Object detection algorithms enable many advanced technologies and are a primary research focus for many industries from transportation to healthcare. For example, a common use for object detection algorithms is to implement them in sensors, such as Lidar, in the systems of autonomous cars to enable self-driving.
There are several object detection algorithms with different capabilities. These algorithms are mostly split into two groups according to how they perform their tasks.
The first group is composed of algorithms based on classification and work in two stages. First, they select the interesting parts of the image, and then they classify objects within those regions using Convolutional Neural Networks (CNN) . This group, which includes solutions such as R-CNN, is usually too slow to be applied in real-time situations.
The algorithms in the second group are based on regression, they scan the whole image and make predictions to localize, identify and classify objects within the image. Algorithms in this group, such as You Only Look Once (YOLO), are faster and can be used for real-time object detection. If you want to train a deep learning algorithm for object detection, you need to understand the different solutions available to you and know which one better suits your needs. Read this article to learn why YOLO is a better overall solution for real-time object detection.
What Is YOLO Object Detection?
You Only Look Once (YOLO) is a network that uses Deep Learning (DL) algorithms for object detection. YOLO performs object detection by classifying certain objects within the image and determining where they are located on it.
For example, if you input an image of a herd of sheep into a YOLO network, it will generate an output of a vector of bounding boxes for each individual sheep and classify it as such.
How YOLO improves over previous object detection methods-
Previous object detection methods like Region-Convolutional Neural Networks (R-CNN), including other variations of it like fast R-CNN, performed object detection tasks in a pipeline of multi-step series. R-CNN focuses on a specific region within the image and trains each individual component separately. This process requires the R-CNN to classify 2000 regions per image, which makes it very time-consuming (47 seconds per individual test image). Thus it, cannot be implemented in real-time. Additionally, R-CNN uses a fixed selective algorithm, which means no learning process occurs during this stage so the network might generate an inferior region proposal.
This makes object detection networks such as R-CNN harder to optimize and slower compared to YOLO. YOLO is much faster (45 frames per second) and easier to optimize than previous algorithms, as it is based on an algorithm that uses only one neural network to run all components of the task. To gain a better understanding of what YOLO is, we first have to explore its architecture and algorithm.
YOLO Architecture – Structure Design and Algorithm Operation
A YOLO network consists of three main parts. First, the algorithm, also known as the predictions vector. Second, the network. Third, the loss function.
The YOLO Algorithm
Once you insert input an image into a YOLO algorithm, it splits the images into an SxS grid that it uses to predict whether the specific bounding box contains the object (or parts of it) and then uses this information to predict a class for the object.
Before we can go into details and explain how the algorithm functions, we need to understand how the algorithm builds and specifies each bounding box. The YOLO algorithm uses four components and additional value to predict an output.
- The center of a bounding box (bx by)
- Width (bw)
- Height (bh)
- The Class of the object (c)
The final predicted value is confidence (pc). It represents the probability of the existence of an object within the bounding box. The (x,y) coordinates represent the center of the bounding box.
Typically, most of the bounding boxes will not contain an object, so we need to use the pc prediction. We can use a process called non-max suppression to remove unnecessary boxes with low probability to contain objects and those who share big areas with other boxes.
The Network
A YOLO network is structured like a regular CNN, it contains convolutional and max-pooling layers and then two fully connected CNN layers.
The Loss Function
We only want one of the bounding boxes to be responsible for the object within the image since the YOLO algorithm predicts multiple bounding boxes for each grid cell. To achieve this, we use the loss function to compute the loss for each true positive. To make the loss function more efficient, we need to select the bounding box with the highest Intersection over Union (IoU) with the ground truth. This method improves predictions by making specialized bounding boxes which improves the predictions for some aspect ratios and sizes.
Comparing YOLO Versions – YOLO V1 vs YOLO V2 vs YOLO V3
The most current version of YOLO is the third iteration of the object detection network. The creators of YOLO designed new versions so to make improvements over previous versions, mostly focusing on improving the detection accuracy.
YOLO V1
The first version of YOLO was introduced in 2015, it used a limited Darknet framework that trained on ImageNet-1000 dataset. This dataset has many limitations and restricts the usability of YOLO V1. Namely, YOLO V1 struggled to identify small objects that appeared as a cluster and was inefficient at generalizing objects in images that had different dimensions than the trained image. This resulted in poor localization of objects within the input image.
YOLO V2
YOLO V2 was released in 2016 with the name YOLO9000. YOLO V2 used darknet-19, a 19-layer network with 11 more layers charged with object detection. YOLO V2 is designed to take on the Faster R-CNN and Single Shot multi-box Detector (SSD) which showed better object detection scores.
YOLO V2 upgrades over YOLO V1 include:
- Improved mean average precision (MAP)—the new higher resolution classifier increased input size from 224*224 in YOLO V1 to 448*448 and improved the MAP.
- Better detection of smaller objects—divides the image into a smaller 13*13 grid cells to improve localization and identification of smaller objects in the image.
- Improved detection within images of varying sizes—training the algorithm with random images of different dimensions to improve the network’s prediction accuracy of objects from input images of different dimensions.
- Anchor boxes—provides a single framework for classification and the prediction of bounding boxes. Anchor boxes are designed for specific datasets using k-means clustering.
YOLO V3
YOLO V3 is an incremental upgrade over YOLO V2, which uses another variant of Darknet. This YOLO V3 architecture consists of 53 layers trained on Imagenet and another 53 tasked with object detection which amounts to 106 layers. While this has dramatically improved the accuracy of the network, it has also reduced the speed from 45 fps to 30 fps.
YOLO V3 upgrades over YOLO V2 include:
- Improved bounding box prediction—uses logistic regression to predict to make a prediction score for all the objects within each bounding box.
- More accurate class predictions—the softmax which has been used for YOLO V2 has been replaced with logistic classifiers for each class for multi-labeling purposes.
- Improved abilities at different scales—makes 3 predictions for every location within the input image to allow for upsampling from previous layers to get fine-grained information and full semantic information and improve the quality of the output.