U-Net for Semantic Segmentation

As the field of computer vision grows, increasingly efficient tools are being built to more finely parse visual information. Deep learning for computer vision and Convolutional Neural Networks (CNNs) are the primary technologies being used. U-Nets are one example of a network architecture that is being used to achieve semantic segmentation of images for a variety of applications from agriculture to medicine.

What Is a U-Net?

A U-Net is a type of CNN that performs semantic segmentation of images. It works by converting an image to vectors used for classification of pixels and then converting those vectors back to an image for segmentation of the classified areas. Based on the architecture of a fully convolutional network, U-Nets were originally developed to perform biomedical image segmentation using smaller training sets than are typically required for CNNs.

U-Net Architecture

U-Nets are made of two “mirrored” sides, a contracting path and an expansive path and making up 23 convolutional layers. The image being processed is cropped after each convolution to ensure that only pixels with image context are used, ensuring valid classification.

Contracting path

This path resembles the typical architecture of a convolutional network . It consists of layers made of two 3×3 convolutions, each followed by a rectified linear unit (ReLU). Each layer is followed by a 2×2 max pooling operation with a stride of two and a doubling of the number of feature channels being used. Convolutions start with 64 feature channels and continue until 1024 channels are present, after which the output moves to the expansive path.

Expansive path

This path is where a U-Net diverges from other CNNs in that the last pooled output is not fed to a fully connected layer. Instead, it undergoes a 2×2 “up-convolution” that halves the feature channels and increases the size of the map to meet the output size at the same level on the contracting path.

This is followed by layers in which the up-convoluted output is concatenated with its pair from the contracting path, to reinclude localization information, and processed in two 3×3 convolutions each followed by a ReLU. This up-convolution is done on successive layers until the final layer where a 1×1 convolution maps the resulting 64-component feature vector to classes.

Semantic Segmentation with U-Nets

Semantic segmentation, also called dense prediction, is the labeling of each pixel in an image with a class identifier so elements of the image can be separated out based on class. This creates a finer segmentation than object detection, which simply identifies a rectangular boundary around the object, but less than instance segmentation, which identifies unique instances of each class in addition to their pixel boundaries.

Semantic segmentation is used in a variety of applications, like environment detection for autonomous vehicles or geo sensing, but medical image analyses in particular benefit from U-Net architecture. Due to HIPAA regulations limiting patient data pools and the time intensiveness of having experts classify images, the training sets available for biomedical image segmentation are small. Data augmentation allows a U-Net to correctly classify tissue despite variations, such as deformation, by teaching invariance and robustness through manipulation of images. This simulates real presentations and artificially expands the size of the available datasets.

U-Net use of weighted loss, where pixels between objects of interest are given more weight as the distance between objects shrinks, allows the network to learn to clearly separate objects and more accurately identify features such as cell boundaries, allowing for a more accurate diagnosis of affected tissues such as in cases of cancer.