DM3T: Harmonizing Modalities via Diffusion for Multi-Object Tracking

Weiran Li, Yeqiang Liu, Yijie Wei, Mina Han, Qiannan Guo, Zhenbo Li*
China Agricultural University Beijing Normal University

Project Updates

  • 2026-06-11
    Release Repository reorganized and the full version released.
  • 2026-03-23
    The next update is planned after the CVPR 2026 conference.
  • 2025-11-19
    Updated sample release for review.
DM3T Architecture Overview
The proposed DM3T framework unifies object detection, state estimation, and data association. It harmonizes modalities via Cross-Modal Diffusion Fusion and utilizes a Hierarchical Tracker for robust association.

Key Features

Cross-Modal Diffusion Fusion

Cross-guided denoising mechanism where RGB and thermal features provide mutual guidance during the diffusion process, effectively harmonizing multi-modal inputs.

Diffusion Refiner

A plug-and-play module designed to enhance and refine unified feature representations through iterative denoising steps, boosting feature distinctiveness.

Hierarchical Tracker

Adaptively handles confidence estimation across multiple levels for improved tracking robustness in occluded or challenging scenes.

End-to-End & Real-time

Unifies object detection, state estimation, and data association without complex post-processing, enabling online tracking with high temporal coherence.

VTMOT Benchmark Results

Method HOTA ↑ IDF1 ↑ MOTA ↑ DetA ↑ MOTP ↑
FairMOT37.3545.8037.2734.6372.53
CenterTrack39.0544.4230.5938.1072.87
TransTrack38.0043.5736.1635.7173.82
ByteTrack38.3945.7633.1532.1273.48
OC-SORT31.4838.0928.9525.2473.15
MixSort-OC39.0945.8031.3333.1173.63
MixSort-Byte39.5846.3731.5934.8173.05
PID-MOT35.6242.4333.3333.2571.79
Hybrid-SORT39.4946.3131.0734.6272.84
PFTrack41.0747.2543.0941.6373.95
DM3T (Ours) 41.70 48.00 36.76 41.46 73.15

Model Zoo

Pretrained Weights

Download the pretrained models required for reproducing our VTMOT baseline and testing tracking.

BaiduYun (Pwd: q8i4)

Citation

@InProceedings{Li_2026_CVPR,
    author    = {Li, Weiran and Liu, Yeqiang and Wei, Yijie and Han, Mina and Guo, Qiannan and Li, Zhenbo},
    title     = {DM{\textasciicircum}3T: Harmonizing Modalities via Diffusion for Multi-Object Tracking},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings},
    month     = {June},
    year      = {2026},
    pages     = {8398-8407}
}

Contact