Many of the companies rely on image segmentation techniques powered by Convolutional Neural Networks (CNNs), which form the basis of deep learning for computer vision. Image segmentation involves drawing the boundaries of the objects within an input image at the pixel level. This can help achieve object detection tasks in real-world scenarios and differentiate between multiple similar objects in the same image.
Semantic segmentation can detect objects within the input image, isolate them from the background and group them based on their class. Instance segmentation takes this process a step further and can detect each individual object within a cluster of similar objects, drawing the boundaries for each of them.
In this article, you will learn what is instance segmentation and how it works as a subtype of image segmentation and what makes it different from the other subtype of image segmentation, semantic segmentation. In addition, you will learn about different algorithms of instance segmentation and how they operate to achieve accurate object detection.
What is Instance Segmentation?
Instance segmentation is a subtype of image segmentation which identifies each instance of each object within the image at the pixel level. Instance segmentation, along with semantic segmentation, is one of two granularity levels of image segmentation.
What Is Image Segmentation?
Image segmentation is a computer vision process designed to simplify image analysis by splitting the visual input into segments that represent objects or parts of objects and form a collection of pixels or “super-pixels”. Image segmentation sorts pixels into larger components, which eliminates the need to consider each pixel as a unit of observation.
Object detection algorithms like YOLO use bounding boxes to indicate the parts of the image that contain an object and then classify it. This restricts their capabilities as they do not provide any information about the shape of the object. For many computer vision tasks, it is not enough to simply identify the object class. These tasks require image segmentation, which indicates the shape of the object, as well as how many times a certain object appears in the image.
Image segmentation allows a granular understanding of the objects within the image. Instead of saying a certain area has sheep, for example, image segmentation can delineate where each individual sheep ends and the next one begins.
Instance Segmentation vs Semantic Segmentation
There are two levels of granularity within the segmentation process:
– Semantic segmentation—classifies objects features in the image and comprised of sets of pixels into meaningful classes that correspond with real-world categories.
– Instance segmentation—identifies each instance of each object featured in the image instead of categorizing each pixel like in semantic segmentation. For example, instead of classifying five sheep as one instance, it will identify each individual sheep.
Instance Segmentation Deep Learning Networks
Instance segmentation is an important step to achieving a comprehensive image recognition and object detection algorithms. Companies like Facebook are investing many resources on the development of deep learning networks for instance segmentation to improve their users experience while also propelling the industry to the future.
Mask R-CNN
Mask Regional Convolutional Neural Network (R-CNN) is an extension of the faster R-CNN object detection algorithm that adds extra features such as instance segmentation and an extra mask head. This allows us to form segments on the pixel level of each object and also separate each object from its background.
The framework of Mask R-CNN is based on two stages: first, it scans the image to generate proposals; which are areas with a high likelihood to contain an object. Second, it classifies these proposals and creates bounding boxes and masks.
Facebook AI Research for Instance Segmentation
The Facebook Artificial Intelligence (AI) Research (FAIR) team has designed techniques to identify and segment each object in image inputs for use in numerous object detection deep learning applications.
These techniques are called DeepMask, Sharpmask and MultiPathNet and they each serve a different purpose in the process. DeepMask and Sharpmask serves as the “eyes” of the algorithm and MultiPathNet as the “brain”.
DeepMask—can locate objects within input images, but cannot describe them and their boundaries.
Sharpmask—refines the output of DeepMask by adding higher-fidelity masks which improves the accuracy of object detection and boundaries.
MultiPathNet—takes the output of DeepMask and Sharpmask and classifies it.
Let’s think of these algorithms like a person looking at the sky and seeing an object. In this scenario, DeepMask is like that person with naked eyes. They can spot the object but are unable to identify it. Sharpmask is like a telescope they can use to identify the object as a bird. Finally, MultiPathNet serves as a guide they can use to classify which bird they see. Thus, instead of saying “it’s an object in the sky”, they can produce a much more definitive description: “it’s an albatross”.
How FAIR Algorithms Power Image Segmentation Methods
The FAIR algorithms, which build on deep learning convolutional neural networks, are designed for object detection tasks. They are able to find patterns in pixels and do object segmentation and classification.
Pattern identification—trains CNN networks to automatically learn patterns in pixels (such as shape and color) based on millions of inputs for generalization and classification of images.
Object segmentation—identifies objects within images using DeepMask and Sharpmask techniques to generate a mask prediction with high accuracy in terms of object presence and boundaries.
Object classification—classifies the output of DeepMask and Sharpmask by using MultiPathNet as the “brain” that recognizes the objects the “eyes” detected.
FAIR Applications
The FAIR algorithms have a wide range of potential applications for computer vision technology. For example, they can be used to allow computers to recognize objects in photos, which will make it easier to search for specific images without adding explicit tags to those photos. Additionally, it can help vision-impaired people interact with content on their computers.
One of the objectives of FAIR is to allow users who suffer from vision loss to understand the content of an image they were tagged in without relying on the caption of the image. Additionally, these algorithms can automatically prove caption suggestions for users who upload images by identifying and classifying the scenery for more detailed image description.