The 5 Computer Vision Techniques of the Future

15 min readFeb 12, 2021

A few weeks ago, I worked on a project using the Raspberry Pi and the Coral USB Accelerator. This project was a computer vision project called the Embedded Teachable Machine. A summary of the project is a device that allows you to train a computer vision system to recognize objects simply by pressing buttons to input training data. You can check out that project over here. In case you do not know what the Raspberry Pi is. It is a python based microcomputer. It also has its own operating system similar to Linux. The Coral USB Accelerator is an Edge TPU that enables high speed machine learning inferencing. After the project was done, I wanted to learn more about computer vision. While researching I learned that there are five main types of computer vision. This article will explore those types and how they work.

What is computer vision? Computer vision is a field of AI that uses deep learning models to understand images and videos.

What you will see in the article?

How a CNN can be used for image classification
Different types of R-CNNs and how they help in object detection
Traditional and untraditional methods for object tracking
How a U-NET does semantic segementation
How Mask R-CNN, a variation of the original, can be be used for instance segmentation

Image Classification

Now that we know what computer vision is, we can dive into a few different computer vision types. The first and most straightforward type of computer vision is image classification. Image classification is exactly what it sounds like. Classifying what an image is. If you give the computer a picture of a dog, you would want it to classify the picture as a dog.

Image classification is done by a CNN most of the time. CNN is short for a Convolutional Neural Network. If you do not know what a neural network is, I recommend checking out this video before you continue to read. A CNN works by assigning different levels of importance to certain features in an image and differentiating between images using those features.

Before going into how a CNN works, let’s understand how a computer sees an image. Computers do not view images as humans can. Each pixel of an image is viewed as a number between 0 and 255. Combined, these numbers create a matrix. Grayscale images can be represented by a 2D matrix, while RGB images require a 3D matrix to be represented.

There are three main layers in a CNN. A convolutional layer, pooling layer, and fully connected layer. Each layer is responsible for playing a different part in the CNN. A CNN can have more than three layers, but three layers are required to classify an image correctly.

Convolutional Layer

The convolutional layer is the first layer in a CNN. The convolutional layer’s main objective is to extract important features such as edges, colors, and corners from the input image. You can think of the convolutional layer as an operation that is performed on a group of pixels in an image to make some features more prominent.

The image on the right is an example of what happens during convolution. The green section represents an input image with the dimensions of 5x5x1. The yellow section represents the kernel or the operation we are applying to the pixels. The kernel is a parameter that is learned by training the CNN. The pink section represents our image after the convolutional layer.

There are also many adjustable parameters when applying the kernel on the input image, such as the stride length and padding. These parameters do not need to be changed but can help when fine-tuning a CNN.

At the end of the convolutional layer, we have a matrix with fewer dimensions than the actual image and more clear features than the actual one. Convolutional is an essential process in a CNN.

Pooling Layer

The pooling layer is the second layer of a CNN. The pooling layer’s main objective is to reduce the image’s dimensions to decrease the computational power required to process the data. You can think of the pooling layer as a way to shrink the image without losing any important features.

Pooling is done by selecting a group of pixels and applying an operation to make that group of pixels into one pixel.

The most popular types of pooling are max pooling and average pooling.

Max pooling is when the maximum value is chosen from a group of pixels to represent the pixels. Average pooling is when the average of a group of pixels represents the group of pixels. Each pooling method has its advantages and disadvantages, but one is not better than the other.

After going through pooling, we have a significantly smaller matrix than our input image that still holds key features of the image.

Fully Connected Layer

So far, we have extracted important features and reduced the dimensions of the input image. The fully connected layer is where the classification takes place.

Before classifying the image, we have to flatten the image from a matrix into a one-column vector.

The flattened output is fed to the CNN, and backpropagation is applied to every iteration of training. The CNN identifies important features in classifying images through the course of its training.

After all the processes are completed and the CNN is trained, it can be used for image classification.

Object Detection

Object detection is another type of computer vision. Object detection is when a computer can recognize one or more objects in one image. Region-CNN (R-CNN) is used most of the time for object detection. R-CNN is a CNN based object detection program. CNNs are the base for many different computer vision techniques. YOLO and SSD can also be used for object detection.

R-CNN

There are two main steps in an R-CNN. The first step is selective search, and the second step is CNN-based classification.

Selective search creates region proposals from the input image. Region proposals are just smaller parts of the original image that could contain the objects we are searching for. Selective search creates about 2000 region proposals in one image.

After the region proposals are created, each proposal is fed to a CNN. You can think of each proposal as its own image. The CNN extracts features of each region proposal and tries to classify each region. If the object is present there, then a bounding box will be put around the item. If not, the CNN will move onto the next region.

As you can imagine running each of the 2000 region proposals through a CNN can be a very time-consuming process. Fast R-CNN is another CNN based object detection program.

Fast R-CNN

Fast R-CNN goes about object detection slightly differently from the original R-CNN method but still has the same basics.

The architecture for Fast R-CNN looks more complicated than it is. The selective search algorithm used in a classic R-CNN generates around 2000 region proposals for each image. Consider training an R-CNN with 1000 images. That would be 2 million images passed through a CNN. The Fast R-CNN solves this.

A Fast R-CNN starts by feeding the input image to the CNN once before creating region proposals. This process generates a convolutional feature map. We will use this feature map in the next step of the Fast R-CNN.

After we generate a feature map of the image, we generate the region proposals of the image. We run these region proposals through a pooling layer called the region of interest (RoI) pooling layer. This is a great resource to learn more about the RoI pooling layer. To simplify, the RoI pooling layer takes the region proposals and shrinks and resizes them so that they can be fed into a fully connected layer. We then use a softmax layer to predict the class of the proposed region and the bounding box’s position.

Many changes took place from a classic R-CNN to a Fast R-CNN. These changes increased the training speed and testing speed per image of the R-CNN.

Faster R-CNN

Fast R-CNN is significantly faster than the original R-CNN, but there is still room for improvement. Both algorithms (R-CNN & Fast R-CNN) use selective search to find out the region proposals. Selective search is a slow process that affects the performance speed.

This is where Faster R-CNN comes in. Similar to Fast R-CNN, Faster R-CNN creates a convolutional feature map with the help of a CNN. However, instead of using the selective search algorithm on the feature map to identify the region proposals, a Faster R-CNN uses a region proposal network (RPN). An RPN has the same function as the selective search algorithm, but they work very differently. To sum it up, an RPN predicts the possibility of a region being background or foreground. This is a great detailed explanation of how the RPN and Faster R-CNN in general works. After the RPN creates region proposals, the proposals are reshaped with an RoI pooling layer, which is then used to classify the image within the proposed region and predict the bounding box’s values.

With each generation of the R-CNN, it got faster. The test speed for Faster R-CNN is almost 250x the test speed for the original R-CNN! Speed and performance for object detection programs are getting better and better.

Object Tracking

Object tracking is not a as prevalent field as object detection is, but it still has its own applications. Object tracking is the process of following a specific object of interest, or multiple objects, in a given image or video.

There are two traditional ways for object tracking, Mean Shift and Optical Flow.

Mean Shift

Mean shift is a popular algorithm, which is mainly used in clustering and other related unsupervised learning. It is similar to the K-Means algorithm. Mean shift works by detecting an object frame by frame. The algorithm starts by detecting the object in the frame. Then it moves to the next frame. In the next frame, it searches for the object again. Unlike regular object detection, mean shift searches for the object within a small region of interest, called the neighborhood. The similarity function searches for the object by scanning the neighborhood and looking for the pixels that are most similar to the object. By repeating this process, the mean shift algorithm can track the object across the frame. We will get back to the mean shift algorithm but let’s move on to the optical flow algorithm.

Optical Flow

Optical flow tracks objects using the object’s motion. Let’s say you have an image with a plane in it. If you have another image with the same plane in a slightly different position, you can estimate the motion vectors using the two frames. If the plane moves at a certain velocity, you will also be able to use these motion vectors to predict the object's trajectory in the next frame.

Challenges

The mean shift and optical flow are faced with many challenges. Occlusion is the first challenge. Occlusion is when an unintended object obstructs the view of the object being tracked. Occlusion is a big problem with any object tracking algorithm. The second challenge is having a moving view. This is a problem because when the objects are being detected, specific features are used to identify the object when tracking it. If the view moves, then the specific features may not help track the object anymore. The last challenge is specifically for the optical flow. Tracking with optical flow rests on four important prerequisites, brightness consistency, spatial coherence, temporal persistence, and limited motion. These are safe assumptions for the optical flow algorithm to assume, but there are always exceptions. If one of the prerequisites is not fulfilled, then the optical flow algorithm’s performance will drop significantly.

The mean shift and optical flow algorithms have good tracking performance but are computationally complex, prone to noise, and suffer from many challenges.

Kalman Filter

The core idea of a Kalman filter is to use the available senor and motion data and previous predictions to take the best guess of an object’s current position.

The Kalman filter can deal with challenges very well. If a camera can see an object, the Kalman filter relies on sensor data for tracking. If an object is partially occluded, then the Kalman filter uses both sensor and motion data. If an object is fully occluded, it relies heavily on motion data. This allows the Kalman filter to deal with occlusion very well.

Simple Real Online Real-time Tracker (SORT)

I know I am jumping around a lot, but here is where it all comes together. SORT stands for Simple Real Online Real-time Tracker. SORT has three main steps: Detection, Estimation, and Target Association. Detection is the first step. Before tracking an object, it is important to have a good detection because the quality of detections has a significant impact on tracking performance. Estimation is the second step. This is where the Kalman filter comes in. It follows the object and its position. Target Association is the third and last step. Target association is where each object gets assigned a bounding box. This is done by the Hungarian Algorithm. It labels or numbers each detection and assigns bounding boxes by predicting the object’s new location in the latest frame.

SORT, Kalman Filter, Mean Shift, and Optical Flow have a lot of math behind them. You can read more about each of them here.

There are many different ways to perform object tracking. Each method has its own advantages and disadvantages. Research in the field is still being done to this day, and these methods are being improved every day.

Semantic Segmentation

Semantic segmentation is another type of computer vision. Semantic segmentation is the process of classifying each pixel belonging to a particular class. It doesn’t differentiate across different instances of the same object. For example, if there are two dogs in an image, semantic segmentation gives the same label to all the pixels of both dogs. We are going to explore how a U-NET does semantic segmentation.

The U-NET is a type of Fully Connected Network (FCN). It was built for medical purposes to find tumors in the human body but also excels in semantic segmentation.

U-NET Operations

Before going into the architecture of a U-NET let's dive deeper into the operations used in a U-NET. The first operation performed is convolution.

We talked about this operation in the Image Classification section, but I will restate its function. Convolution uses kernels to extract features from an image.

The second operation performed is max pooling. Max pooling is used to make the data less computationally intensive by shrinking the image's dimensions without losing any important features. Max pooling may be referred as down sampling in this section.

Max pooling is done to make the image smaller. The output in semantic segmentation is a high-resolution image in which all the pixels are classified. This is hard to do with a small image because, by down sampling, the model better understands “WHAT” is present in the image, but it loses the information of “WHERE” it is present. In order to get the correct output, the image would have to be up sampled to recover the information of “WHERE”. Many techniques are used to up sample an image like bi-linear interpolation, cubic interpolation,, and nearest-neighbor interpolation. Most semantic segmentation networks use transposed convolution.

Transposed convolution is the third operation in a U-NET. Transposed convolution (otherwise known as deconvolution or fractionally strided convolution) is a technique to up sample and image. I will not delve into how transposed convolution works,, but Naoki Shibuya does a good job explaining transposed convolution in his post Up-sampling with Transposed Convolution. On a high level, transposed convolution is the opposite of convolution. i.e., the input is a low-resolution image, and the output is a high-resolution image.

U-NET Architecture

The name U-NET came out of its architecture because it is shaped like a U. We know the operations in a U-NET, so now let's explore how those operations play a role in the architecture of the U-NET. There are two main parts of the U-NET architecture. The first path is the encoder path. The encoder path is just a bunch of convolution and pooling layers that are used to find what is in the image and classify it. The second path is the decoder path. The decoder layer does transposed convolution to find the precise location of each classification. Once the whole process is done, we end up with an image with every pixel classified.

There are lots of different ways to perform semantic segmentation, and U-NET is just one of them. To learn more about semantic segmentation and some other techniques, click here.

Instance Segmentation

Instance segmentation is when each pixel in an image is classified. Instance segmentation classifies each instance (object) of a classification differently.

What is the difference between semantic segmentation and instance segmentation? Both types of computer vision classify every pixel in an image. The difference is that semantic segmentation does not classify each instance of an object differently, while instance segmentation does. Let’s say you have an image with two dogs. Semantic segmentation would color both dogs the same color and not differentiate between them because they belong to the same class. Instance segmentation would color both dogs different colors and give each dog its own label even though they belong to the same class.

Now that we know the difference between instance segmentation and semantic segmentation. Let’s explore how instance segmentation is actually done. Mask R-CNN is the method most used instance segmentation. Remember the Faster R-CNN we used for object detection, Mask R-CNN is a variation of that. Let’s recap how a Faster R-CNN works.

First, a convolution feature map is created by passing the image through a CNN. Then the region proposal network (RPN) creates region proposals for the Faster R-CNN to classify. This leaves us with the output of an image with bounding boxes around the classified objects.

That is how Faster R-C N works. Mask R-CNN adds an extra step right after the RoI pooling layer pools the image.

Mask R-CNN can be separated into two main steps, an RPN and binary mask classifier. The RPN is part of Faster R-CNN. The binary mask classifier is the extra added step. This is the step that colors the pixels in the image. After the object in the image is detected, Mask R-CNN can run a semantic segmentation algorithm on it. Each object can be thought of as its own image. This creates the binary mask classifier. The output is an image with each object classified and each pixel colored. This is the process of how instance segmentation is performed.

TL;DR

There are five main computer vision techniques, image classification, object detection, object tracking, semantic segmentation, and instance segmentation
Image Classification: Classifying what an image is (Done with classical CNN)
Object Detection: Detecting one or more objects in an image (Done with R-CNN)
Object Tracking: Tracking one or more object in images or a video (Done with SORT)
Semantic Segmentation: Classifying each pixel in an image without differentiating between instances (Done with FCNs like U-NET)
Instance Segmentation: Classifying each pixel in an image while differentiating between instances (Done with Mask R-CNN)

Conclusion

Each of these computer vision techniques are different and can have various applications in the real world, from self-driving cars to pose estimation. These techniques are being improved on every day. Computer vision will play a significant role in the future, so it is important to start experimenting and researching today.

I am Niyel a 13-year-old who is interested in the fields of emerging technologies like AI, Gene Editing, Blockchain, and many more. I am currently a student at TKS a human accelerator program. I am curious and always interested in new things.
LinkedIn | Youtube | Website