Object tracking is a field within computer vision that involves tracking objects as they move across several video frames. In this article, we’ll address the difference between object tracking and object detection, and see how with the introduction of deep learning the accuracy and analysis power of object detection vastly improved.
We’ll see some challenges of object tracking compared to static object detection, including re-identification, appearance and disappearance, and occlusion. In the Object tracking algorithms section, we’ll see three commonly used object tracking algorithms that use deep learning methods: SORT, GOTURN and MDNet. Read on to find out about object tracking with deep learning in the real world.
What Is Object Tracking?
Object tracking is a field within computer vision that involves tracking objects as they move across several video frames. Objects are often people, but may also be animals, vehicles or other objects of interest, such as the ball in a game of soccer. Below are impressive results achieved by SORT, a deep learning object tracking algorithm.
It has many practical applications including surveillance, medical imaging, traffic flow analysis, self-driving cars, people counting and audience flow analysis, and human-computer interaction.
Technically, object tracking starts with object detection —identifying objects in an image and assigning them bounding boxes. The object tracking algorithm assigns an ID to each object identified in the image, and in subsequent frames tries to carry across this ID and identify the new position of the same object.
There are two main types of object tracking:
- Offline object tracking—object tracking on a recorded video where all the frames, including future activity, are known in advance.
- Online object tracking—object tracking done on a live video stream, for example, a surveillance camera. This is more challenging because the algorithm must work fast, and it is not possible to take future frames and combine them into the analysis.
Object Tracking vs Object Detection
Object detection has evolved substantially in the past two decades, with the move from traditional statistical or machine learning approaches to deep learning approaches based on Convolutional Neural Networks (CNN) . The introduction of deep learning improved the accuracy and analysis power of object detection by an order of magnitude. To some, object tracking is simply an extension of object detection.
The creators of a popular algorithm called Simple Online and Realtime Tracking (SORT) make the assertion that modern object detection algorithms can do most of the work of detecting objects and re-identifying in subsequent frames, and object tracking can be reduced to simple heuristics. Others have developed extensive object training algorithms that work in tandem with object detection, and apply deep learning techniques to carry over an identified object into the next video frames.
Challenges of Object Tracking compared to Static Object Detection
- Re-identification—connecting an object in one frame to the same object in the subsequent frames
- Appearance and disappearance—objects can move into or out of the frame unpredictably and we need to connect them to objects previously seen in the video
- Occlusion—objects are partially or completely occluded in some frames, as other objects appear in front of them and cover them up
- Identity switches—when two objects cross each other, we need to discern which one is which
- Motion blur—objects may look different due to their own motion or camera motion
- View points—objects may look very different from different viewpoints, and we have to consistently identify the same object from all perspectives
- Scale change—objects in a video can change scale dramatically, due to camera zoom for example
- Illumination—lighting changes in a video can have a big effect on how objects look and can make it harder to consistently detect them
Object Tracking Algorithms
In this section, we’ll introduce three popular object tracking algorithms that use deep learning methods: SORT, GOTURN and MDNet.
Simple Online and Real-Time Tracking (SORT)
SORT is an object tracking algorithm that relies mainly on the analysis of an underlying object detection engine. It can plug into any object detection algorithm. The algorithm tracks multiple objects in real time, associating the objects in each frame with those detected in previous frames using simple heuristics. For example, SORT maximizes the IOU (intersection-over-union) metric between bounding boxes in neighboring frames.
Generic Object Tracking Using Regression Network (GOTURN)
GOTURN is trained by comparing pairs of cropped frames from thousands of video sequences. In the first frame (“previous frame”), the location of the object is known, and the frame is cropped to twice the size of the bounding box around the object, with the object centered. The algorithm then tries to predict the location of the same object in the second frame (“current frame”). The same double-sized bounding box is used to crop the second frame. A Convolutional Neural Network (CNN) is trained to predict the location of the bounding box in the second frame.
Multi-Domain Network (MDNet)
Multi Domain Network (MDNet) is a CNN architecture that won the VOT2015 challenge. The objective of MDNet is to speed up training in order to provide real-time results. The strategy is to split the network into two parts. The first part acts as a generic feature extractor that trains over multiple training sets and learns to distinguish objects from their background. The second part is trained on a specific training set and learns to identify objects within video frames.
So MDNet makes it possible to modify the weights of only the last few CNN layers during training, reducing computation time significantly.