DM³T: Harmonizing Modalities via Diffusion for MOT

Project Updates

2026-06-11

Release Repository reorganized and the full version released.
2026-03-23

The next update is planned after the CVPR 2026 conference.
2025-11-19

Updated sample release for review.

DM3T Architecture Overview — The proposed DM³T framework unifies object detection, state estimation, and data association. It harmonizes modalities via Cross-Modal Diffusion Fusion and utilizes a Hierarchical Tracker for robust association.

Key Features

Cross-Modal Diffusion Fusion

Cross-guided denoising mechanism where RGB and thermal features provide mutual guidance during the diffusion process, effectively harmonizing multi-modal inputs.

Diffusion Refiner

A plug-and-play module designed to enhance and refine unified feature representations through iterative denoising steps, boosting feature distinctiveness.

Hierarchical Tracker

Adaptively handles confidence estimation across multiple levels for improved tracking robustness in occluded or challenging scenes.

End-to-End & Real-time

Unifies object detection, state estimation, and data association without complex post-processing, enabling online tracking with high temporal coherence.

VTMOT Benchmark Results

Method	HOTA ↑	IDF1 ↑	MOTA ↑	DetA ↑	MOTP ↑
FairMOT	37.35	45.80	37.27	34.63	72.53
CenterTrack	39.05	44.42	30.59	38.10	72.87
TransTrack	38.00	43.57	36.16	35.71	73.82
ByteTrack	38.39	45.76	33.15	32.12	73.48
OC-SORT	31.48	38.09	28.95	25.24	73.15
MixSort-OC	39.09	45.80	31.33	33.11	73.63
MixSort-Byte	39.58	46.37	31.59	34.81	73.05
PID-MOT	35.62	42.43	33.33	33.25	71.79
Hybrid-SORT	39.49	46.31	31.07	34.62	72.84
PFTrack	41.07	47.25	43.09	41.63	73.95
DM³T (Ours)	41.70	48.00	36.76	41.46	73.15

Model Zoo

Pretrained Weights

Download the pretrained models required for reproducing our VTMOT baseline and testing tracking.

BaiduYun (Pwd: q8i4)

Citation

@InProceedings{Li_2026_CVPR,
    author    = {Li, Weiran and Liu, Yeqiang and Wei, Yijie and Han, Mina and Guo, Qiannan and Li, Zhenbo},
    title     = {DM{\textasciicircum}3T: Harmonizing Modalities via Diffusion for Multi-Object Tracking},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings},
    month     = {June},
    year      = {2026},
    pages     = {8398-8407}
}

Contact

vranlee@cau.edu.cn

DM3T: Harmonizing Modalities via Diffusion for Multi-Object Tracking