A capsule network is a neural network architecture developed by Stanford scientist Geoffrey Hinton. In the capsule network, a distinct approach is applied to image processing, which includes equivariant mapping and the mapping of the hierarchy of parts. Equivariant mapping supports the preservation of position and pose information. The mapping of the hierarchy of parts assigns each part to a whole. This information provides an understanding of objects, from a three-dimensional context.
In this we’ll explain the concept of capsule networks, what is a capsule, how a CapsNet differs from a traditional convolutional neural network (CNN).
A Brief History
Geoffry Hinton and his team, first introduced the concept of CapsNets in 2011, in a paper titled “Transforming Autoencoders”. However, it was only in 2017, that Hinton and his team developed a dynamic routing mechanism for capsule networks. This process was believed to decrease error rates on MNIST and to decrease training set sizes. Results were deemed to be of higher quality than a Convolutional Neural Network (CNN) on highly overlapped digits.
What Are Capsule Neural Networks? (CapsNet)
A Capsule Neural Network (CapsNet) is a type of Artificial Neural Network (ANN). The approach, which aims to mimic biological neural organizations, can be used to model hierarchical relationships. The concept involves adding structures known as capsules to a convolutional neural network.
Traditional CNNs
In traditional CNNs, every neuron in the first layer corresponds to a pixel. It then feeds information about the pixel to the next layers. The next convolutional layers gather a group of neurons so that an individual neuron can represent a whole group of neurons.
A CNN can thus learn to represent a group of pixels that look like, for example, the eye of a cat, particularly if we have several examples of cat eyes in our data set. The neural network will learn to increase the weight (importance) of that eye neuron feature when determining if that image is of a cat.
The role of capsules
This approach only considers the existence of the object in the image around a specific location. It does not care about the direction of the object or spatial relations. Capsules, however, contain more information about individual objects. A capsule is a vector. Capsules denote the features of the object and its probability. These features can include parameters such as “pose” (size, position, orientation), velocity, deformation, albedo (light reflection), texture, and hue.
The Need for CapsNets in Image Classification
Although convolutional networks successfully implement computer vision tasks, including localization, classification, object detection, instance segmentation or semantic segmentation, the need for CapsNets in image classification arises because:
CNNs are trained on large numbers of images (or reuse parts of neural networks that have been trained). CapsNets generalize effectively and require less training data.
CNNs can’t deal with ambiguity well. CapsNets do, so they can make sense of crowded scenes. However, they currently struggle with backgrounds.
CNNs need additional components to automatically recognize which object a part belongs to, for example, this arm belongs to this person. CapsNets provide a hierarchy of parts.
Drawbacks of CNNs
CNNs are not good with changes in object orientation
If you turn an image of a face upside down, the network won’t be able to identify the nose, eyes, mouth, and the spatial connection between these features. Furthermore, if you switch the location of the eyes and nose, the CNN network will identify the face, although it is not a “real” face. CNNs can learn statistical patterns in images, but they can’t learn what makes something look like a face.
CNNs lose information in the pooling layers
When performing pooling, CNNs tend to lose information that is useful for performing tasks such as object detection and image segmentation. When the pooling layer loses necessary spatial information about location, rotation, scale, and different positional characteristics of the object, the task of object segmentation and detection becomes challenging.
Today, CNN architects can reconstruct positional information, via advanced techniques. However, these techniques are not entirely accurate, and reconstruction is an involved process. An additional issue with the pooling layer is that if the position of the object is slightly altered, the activation doesn’t change with its proportion. The result is accurate in relation to image classification, however, it can make it hard to precisely locate an object in an image.
How Capsule Networks Solve These Problems
Capsule networks work by implementing groups of neurons that encode spatial information as well as the likelihood of an object existing in an image. The length of a capsule vector is the likelihood that a feature is present in the image, and the direction of the vector corresponds with its pose information.
A capsule is a group of neurons whose activity vector corresponds to the instantiation parameters of an object or object part. We use the length of the activity vector to show the likelihood that the entity is present. The orientation of the vector indicates the instantiation parameters.
In computer graphics functions like rendering and design, the CNN network typically generates objects by giving them a parameter which will render the form. In capsule networks, however, the opposite occurs. The network learns how to inversely render an image by examining the image and attempting to predict what the instantiation parameters are for the image. The network learns to predict this by attempting to reproduce the object it believes it detected and contrasting it against the labeled example from the training data. It thus gets better at predicting the instantiation parameters.
Addressing the “Picasso problem”
CapsNets can address the “Picasso problem” in image recognition: images that show the right component but they are not in the right spatial relationships, for example, if in a “face” the location of the eye and ear are swapped. While viewpoint changes have nonlinear effects at the pixel level, they have linear effects at the object or part level, and CapsNets use this. This is like inverting the rendering of an object with several parts.
Viewpoint invariance
A traditional CNN can only identify a cat face based on similar cat face detectors stored within the training dataset, with similar size and orientation, as the features of the cat face are kept in locations inside the pixel frame. For example, it may have a picture of a cat face where the nose is around pixels [60, 60], the mouth is around [60, 30], and the eyes around [30, 80] and [80, 80]. Thus, it can only identify images that have similar features in similar locations.
Therefore, it needs to have a distinct representation for a “cat face rotated by 30 degrees” or a “small cat face”. Those representations would ultimately map to the same class, but it means the CNN must see many examples of each transformation type to be able to recognize a cat face in the future.
A capsule network can develop a general representation of a “cat face” and see the transformation (size, rotation, etc.) of each of its features (mouth, snout, etc.). It can tell if all the features are transformed or rotated in the same direction and to the same degree. It can, therefore, predict with more confidence that it is a cat face.
CapsNets generalize the class, rather than memorizing each single viewpoint variant, so it is not restricted to a specific viewpoint.