DM3T: Harmonizing Modalities via Diffusion for Multi-Object Tracking

Supplementary Material - Project

Paper (Under Review) Apache 2.0 Python 3.7+ PyTorch 1.7+

Project Updates

  • 2025-11-19
    New New version, sample release for review.
  • 2025-08-05
    Initial sample release for internal testing.

Key Features

Cross-Modal Diffusion Fusion

Novel cross-guided denoising mechanism where RGB and thermal features provide mutual guidance during the diffusion process, effectively harmonizing multi-modal inputs.

Diffusion Refiner

A plug-and-play module designed to enhance and refine unified feature representations through iterative denoising steps, boosting feature distinctiveness.

Hierarchical Tracker

Adaptively handles confidence estimation across multiple levels for improved tracking robustness in occluded or cluttered scenes.

End-to-End & Real-time

Unifies object detection, state estimation, and data association without complex post-processing, enabling online tracking with high temporal coherence.

Architecture Overview

Architecture Overview

The proposed DM3T framework. It consists of Cross-Modal Diffusion Fusion for harmonizing modalities and a Hierarchical Tracker for robust association.

Benchmark Results (VTMOT)

Method HOTA ↑ IDF1 ↑ MOTA ↑ DetA ↑ MOTP ↑
FairMOT 37.35 45.80 37.27 34.63 72.53
CenterTrack 39.05 44.42 30.59 38.10 72.87
ByteTrack 38.39 45.76 33.15 32.12 73.48
PFTrack 41.07 47.25 43.09 41.63 73.95
DM3T (Ours) 41.70 48.00 36.76 41.46 73.15

Evaluation metrics on VTMOT Test split. HOTA and IDF1 are the primary metrics for multi-object tracking performance.

Sample C-MDF Preview

core.py - CrossModalDiffusionFusion

class CrossModalDiffusionFusion(nn.Module):
    def __init__(self, channels, steps=3, noise_level=0.01):
        super(CrossModalDiffusionFusion, self).__init__()
        self.steps = steps
        self.noise_level = noise_level

        self.denoise_rgb = self._make_denoise_net(channels * 2, channels)
        self.denoise_t = self._make_denoise_net(channels * 2, channels)

    def forward(self, x_rgb, x_t):
        refined_rgb = x_rgb
        refined_t = x_t

        # Iterative cross-modal guidance
        for _ in range(self.steps):
            # Add noise for diffusion process
            noisy_rgb = refined_rgb + torch.randn_like(refined_rgb) * self.noise_level
            noisy_t = refined_t + torch.randn_like(refined_t) * self.noise_level

            # Cross-guidance: RGB guides Thermal, Thermal guides RGB
            guidance_for_rgb = torch.cat([noisy_rgb, refined_t], dim=1)
            residual_rgb = self.denoise_rgb(guidance_for_rgb)
            refined_rgb = noisy_rgb + residual_rgb

            guidance_for_t = torch.cat([noisy_t, refined_rgb], dim=1)
            residual_t = self.denoise_t(guidance_for_t)
            refined_t = noisy_t + residual_t

        return refined_rgb + refined_t

Getting Started

Environment Setup

# Create conda environment

conda create -n dm3t python=3.8

conda activate dm3t


# Install Dependencies

pip install -r requirements.txt

Dataset Preparation

Organize the VTMOT dataset structure as follows:

VTMOT/ ├── images/ │ ├── train/ │ │ ├── $sequence_name$/ │ │ │ ├── gt/ │ │ │ └── infrared/ ... └── annotations/ ├── train/ └── test/

Then convert annotations:

python tools/convert_vtmot_to_coco.py

Training

cd src
python main.py tracking \
  --exp_id exp_v1 \
  --dataset vtmot \
  --arch dla_34

Evaluation

cd trackeval
python eval.py \
  --BENCHMARK VTMOT \
  --SPLIT_TO_EVAL test