DukeMTMC– Technical Specifications

1Time Synchronization

Each camera has its own local time starting at frame 1. Below is the total number of frames for each camera:

NumFrames = {359580, 360720, 355380, 374850, 366390, 344400, 337680, 353220}

The master camera is Camera 5. The first frame of each camera is synchronized to the master camera’s local timeas follows:

StartTimes = {5543,3607,27244,31182,1,22402,18968,46766}

As an example, frame 1.jpg in Camera 1 is synchronized to frame 5543.jpg in Camera 5.

The multi-camera dataset goes from frame 47720 to 356648 = 1 hour and 15.89 minutes @ 59.940059 fps.

Person detections were generated using the Deformable Part Model algorithm. We provide 8 .mat files, one per camera, with the following data format:

[camera, frame, left, top, right, bottom, …, left, top, right, bottom, S, confidence]

There are 9 [left, top, right, bottom] quadruples, one for the main box and 8 for the parts. ‘S’ is a DPM specific value, and the last value is the detector confidence. We thresholded at -1.

2Annotations

Annotations are provided as the image coordinates of points where the center of mass of each person would project down to the ground. The timestamp of each annotation is given in the camera’s local time frame.

Annotations contain X and Y coordinates of the click normalized to [0,1] to provide flexibility for image resizing during annotation.

3Ground Truth

Ground truth is provided in two forms, world coordinates (wx, wy) and bounding boxes (left, top, width, height). World coordinates are generated first by projecting annotation points (feetX, feetY) from the image plane to the ground plane, then interpolating between the annotatedkeypoints. Bounding boxes are semi-automatically generated by creating boxes for each trajectory. We manually specified a scaling factor for the height of each trajectory to reduce mistakes in ground truth box generation.

The data format for ground truth is:

[camera, ID, frame, left, top, width, height, wx, wy, feetX, feetyY]

4Train/Test Data

We split the dataset into one training/validation set and two test sets, test-easy and test-hard. The partition is given below:

Trainval: 0-50 min (dataset frames 47720-227540)

Test-easy: 60-85.89 min (dataset frames 263504-356648)

Test-hard: 50-60min (dataset frames 227541-2263503)

5Evaluation

The evaluation is executed on the image plane and for all frames of the test set. We report multi-camera performance by IDP, IDR and IDF1 scores [2]. We report single camera performance by additionally providing existing scores from MOTChallengedevkit. For both single- and multi-camera scenarios trackers are ranked by their IDF1 performance. Evaluation uses 50% overlap threshold.

The evaluation scriptuses ground truth trajectories clipped to each camera’s region of interest. The script will also clip your tracker output by discarding any rows that contain data points outside the ROI. Bounding boxes with feet position inside the ROI will be preserved.

[1] A Discriminatively Trained, Multiscale, Deformable Part Model. P. Felzenszwalb, D. McAllester, D. Ramanan. CVPR 2008

[2] Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking. E. Ristani, F. Solera, R. S. Zou, R. Cucchiara and C. Tomasi. ECCV 2016 Workshop on Benchmarking Multi-Target Tracking.