Convolutional Neural Networks

1. Feature Detection
2. Filters detect Features
3. Convolutional Neural Networks (CNNs)
4. Object Detection
- 4.1. Object Detection with R-CNNs
- 4.2. Faster R-CNN
5. Semantic Segmentation: Fully Convolutional Networks
6. Continuous Control

Lecture 3: Convolutional Neural Networks: In which we learn about CNNs, how filters detect features, and how deep stacking of these filters can be used to do many computer vision tasks.

We use vision to recognize what is there, where it is and also use it to predict future motion & intent. So, the task of Compute Vision is more than simply detecting stuffs.

1. Feature Detection

For computers image is input as matrix of numbers. And image classfication is done by High Level Feature Detection (0:08:45)

0:09:53 Feature detection could be done manually:

Use Domain Knowledge, then
Define features then
Detect features to classify

But its not easy. Defining features may be easy but problems in detection are: 0:11:00

Viewpoint variation
Illumination
Scale variation
Deformation
Occlusion
Background clutter
Intra-class variation

Figure 1: Difficulty in Feature Detection

Instead Use Neural Network for Feature Detection.

Fully Connected Layers:
- Loses spatial information becuase image is flattened
- Requires high number of parameters
Convolutional Filter (0:16:50)
- Preserves Spatial Information
- Lesser number of parameters

2. Filters detect Features

0:20:00

Figure 2: Example of Filters detecting features for X shape

0:23:21 Filters are applied to input through Convolution Operation.

3. Convolutional Neural Networks (CNNs)

cnn_spatial_arrangement_of_output_volumne-20230316133419.png

Figure 3: CNN: Spatial Arrangement of Output Volume

In convolutional neural network, we apply (convolute) filters to the input image.

Each filter we apply outputs a 2d result
So, after applying multiple filters to an input image we get an Output Volume. The height and width of the output may be smaller (downsampled) or same or higher.
The output volume can be further convoluted to extract higher level features from lower level features.

For classification:

We finally feed the feature volume to a Fully connected network
then to softmax for classification

In CNNs the input is passed usually through

Convolution Layer (containing Convolution & A bias)
Non-linearity (e.g. ReLU)
Pooling (e.g. Max-Pooling) (0:32:53)

Usually as you we do pooling (down sampling), we increase the number of features. But recently (in e.g. Patches Are All You Need?) isotropic model have shown to be give good results. Isotropic model use the same height and width throughout the network. (i.e. they don't downsample)

We can keep the first part of the network (the feature learning) and swap out the second part depending upon task: (0:36:10)

Classification
Object Detection
Segmentation
Probabilistic Control

4. Object Detection

@ 0:38:51 Object detection is the problem of finding a bounding box were the detected object is. There may be multiple objects of different types of various sizes. So this is a difficult problem.

Native Method would be to:

Sample lots of different sizes and positions of boxes
Clip the image to the box, and send to classification network

4.1. Object Detection with R-CNNs

0:41:39 Find regions that we think have objects. Use CNN to classify. So, a model proposes regions for object classification, and then we pass the region to object classification network.

Figure 4: R-CNNs

Demerits:

Brittle
Region Extraction network is detached from classification network

4.2. Faster R-CNN

0:42:01

Figure 5: Faster R-CNN

Learns the Region Proposal Network along with feature extraction and classification
Feeds the image to feature extraction only once
Grab all the regions, process then independently
- Pass the region through feature detection head again (?)
- Pass to classification network
Faster than R-CNN

5. Semantic Segmentation: Fully Convolutional Networks

0:43:29

Semantic Segmentation is like one classification per pixel. Fully Convolutional Networks tackle this problem by using downsampling operation in first half and upsampling operation in second half of the network.

Figure 6: Semantic Segmentation

6. Continuous Control

0:45:15

Navigation from Vision is a Continuous control task: A model that decides the steering angle from input image. Here the output is a continuous probability distribution. This is different from classification and segmentation task.

Figure 7: Navigation from Vision

end_to_end_framework_for_autonomous_navigation-20230316140513.png In the above model, continuous control is done as follows:

The top part of model doesn't see the route, and outputs a probabilistic control output
The bottom part sees the route, and outputs a path to take given the route

The loss function here is interesting because in reality we won't take multiple paths but a single one at a given intersection. However, after seeing a bunch of intersections the model will learn the different paths that can be taken. (0:46:26)

\(L = - \log(P(\theta | I, M))\)