There are several types of Convolutional Neural Networks (CNNs) being developed and all have the potential to greatly contribute to the speed and accuracy of automatic image identification. In particular, 3D CNNs are being created to improve the identification of moving and 3D images, such as video from security cameras and medical scans of cancerous tissue, a time-consuming process that currently requires expert analysis.
Development of 3D CNNs is still at an early stage due to their complexity, but the benefits they can deliver are worth educating yourself on. Read on to learn more how this new field takes deep learning for computer vision to a whole new level.
What Is a CNN?
CNN is a class of deep neural networks, which can be used in conjunction with a deep learning platform. A CNN is a network of processing layers used to reduce an image to its key features so that it can be more easily classified. The advantage of CNNs over other uses of classification algorithms is the ability to learn key characteristics on their own, reducing the need for hyperparameters, hand-engineered filters. These algorithms are increasingly being used for tasks such as facial recognition, image classification, video analysis, and automatic caption generation.
CNN Architecture
To maximize efficiency, a CNN operates in three layers:
Convolution Layer
This layer is where images are translated into processable data by kernels, a filter layer consisting of learned parameters. Each kernel filters for a different feature and multiple kernels are used in each analysis. In a convolution, small areas of an image are scanned and the probability that they belong to a filter class is assigned and translated to an activation map, a representation of the image layers. In a 3D CNN, the kernels move through three dimensions of data (height, length, and depth) and produce 3D activation maps.
Pooling Layer
Pooling, or downsampling, is done on the activation maps created during convolution. During pooling, a filter moves across an activation map evaluating a small section at a time, similar to the convolution process. This filter takes either the average of the scanned area, a weighted average based on the central pixel, or the max value and abstracts that value to a new map.
The maxpooling method, where the highest value from the scanned area is taken, is the most commonly used because it acts as a noise suppressant during compression. This abstraction is done to reduce the processing power needed to evaluate each map by eliminating unimportant features and allows for spatial variance, the ability to detect features regardless of rotation or tilting.
Fully Connected (FC) Layer
After multiple iterations, sometimes thousands, of convolution and pooling the output layers are flattened, the probabilities identified are analyzed, and the output is assigned a value, a logit. This analysis is done by the Fully Connected layer, in which each flattened output layer is processed by interconnected nodes, similar to a fully connected neural network (FCNN). The difference is that in a CNN the convolutional and pooling layers are independent of the FC layer. By isolating features of an image before feeding the output to the FC layer, CNN is able to restrict the need for higher processing power to the final steps.
If you want to delve deeper into how various architectures of CNN develop and their benefits over each other and also learn how to implement the models built using CNN then, you can start your learning at Computer Vision with Deep Learning.
Uses for 3D Convolutions
The 3D activation map produced during the convolution of a 3D CNN is necessary for analyzing data where temporal or volumetric context is important. This ability to analyze a series of frames or images in context has led to the use of 3D CNNs as tools for action recognition and evaluation of medical imaging.
Human action recognition
Action recognition is the process of analyzing the position of objects in a sequence of 2D images, like a video, and classifying it in the context of the surrounding frames to either interpret or predict object movement. Action recognition is being used in the development of assistive technologies, like smart homes, automation of surveillance or security systems and virtual reality applications, such as creating decentralized meeting spaces.
This process is complicated by the need to account for unrelated movement, like that of the camera or background objects, the pace at which 3D convolution can be done, and the lack of adequate datasets available for parameter modeling. Currently, a two-stream method, in which spatial and temporal data is analyzed independently at the convolutional and pooling layers and joined at the FC layer, shows the most promise.
Medical imaging
Similar to the way CNNs are being used to evaluate video, they can be used to analyze medical imaging, such as CT scans or MRI, for purposes of detection, diagnosis, and development of patient-specific devices. Currently, medical imaging is done by capturing slices of the depth of the tissue to be evaluated but because the body is made of 3D structures that move, all of the images must be viewed in context to be useful. By combining these static images with volume or spatial context, processes such as identification of cancerous cells, evaluation of arterial health, and structural mapping of brain tissue can be initially processed by a 3D CNN, reducing the time needed for human evaluation and allowing faster patient care.
A major hurdle for CNN use in medical practice is the difficulty of training due to the requirements of obtaining datasets: images must follow HIPAA guidelines for patient privacy, and be analyzed by experts as opposed to crowdsourcing, such as CAPTCHA. In an attempt to bypass these limitations, synthetic training data is being created through data augmentation and combined with small authentic datasets. This is possible because medical images often contain a wider variety of actionable information per image, allowing them to be used in training multiple kernels.