Convolutional Neural Networks
Table of Contents
Lecture 3: Convolutional Neural Networks: In which we learn about CNNs, how filters detect features, and how deep stacking of these filters can be used to do many computer vision tasks.
We use vision to recognize what is there, where it is and also use it to predict future motion & intent. So, the task of Compute Vision is more than simply detecting stuffs.
1. Feature Detection
For computers image is input as matrix of numbers. And image classfication is done by High Level Feature Detection (0:08:45)
0:09:53 Feature detection could be done manually:
- Use Domain Knowledge, then
- Define features then
- Detect features to classify
But its not easy. Defining features may be easy but problems in detection are: 0:11:00
- Viewpoint variation
- Illumination
- Scale variation
- Deformation
- Occlusion
- Background clutter
- Intra-class variation
Figure 1: Difficulty in Feature Detection
Instead Use Neural Network for Feature Detection.
- Fully Connected Layers:
- Loses spatial information becuase image is flattened
- Requires high number of parameters
- Convolutional Filter (0:16:50)
- Preserves Spatial Information
- Lesser number of parameters
2. Filters detect Features
0:20:00
Figure 2: Example of Filters detecting features for X shape
0:23:21 Filters are applied to input through Convolution Operation.
3. Convolutional Neural Networks (CNNs)
Figure 3: CNN: Spatial Arrangement of Output Volume
In convolutional neural network, we apply (convolute) filters to the input image.
- Each filter we apply outputs a 2d result
- So, after applying multiple filters to an input image we get an Output Volume. The height and width of the output may be smaller (downsampled) or same or higher.
- The output volume can be further convoluted to extract higher level features from lower level features.
For classification:
- We finally feed the feature volume to a Fully connected network
- then to softmax for classification
In CNNs the input is passed usually through
- Convolution Layer (containing Convolution & A bias)
- Non-linearity (e.g. ReLU)
- Pooling (e.g. Max-Pooling) (0:32:53)
Usually as you we do pooling (down sampling), we increase the number of features. But recently (in e.g. Patches Are All You Need?) isotropic model have shown to be give good results. Isotropic model use the same height and width throughout the network. (i.e. they don't downsample)
We can keep the first part of the network (the feature learning) and swap out the second part depending upon task: (0:36:10)
- Classification
- Object Detection
- Segmentation
- Probabilistic Control
4. Object Detection
@ 0:38:51 Object detection is the problem of finding a bounding box were the detected object is. There may be multiple objects of different types of various sizes. So this is a difficult problem.
Native Method would be to:
- Sample lots of different sizes and positions of boxes
- Clip the image to the box, and send to classification network
4.1. Object Detection with R-CNNs
0:41:39 Find regions that we think have objects. Use CNN to classify. So, a model proposes regions for object classification, and then we pass the region to object classification network.
Figure 4: R-CNNs
Demerits:
- Brittle
- Region Extraction network is detached from classification network
4.2. Faster R-CNN
0:42:01
Figure 5: Faster R-CNN
- Learns the Region Proposal Network along with feature extraction and classification
- Feeds the image to feature extraction only once
- Grab all the regions, process then independently
- Pass the region through feature detection head again (?)
- Pass to classification network
- Faster than R-CNN
5. Semantic Segmentation: Fully Convolutional Networks
0:43:29
Semantic Segmentation is like one classification per pixel. Fully Convolutional Networks tackle this problem by using downsampling operation in first half and upsampling operation in second half of the network.
Figure 6: Semantic Segmentation
6. Continuous Control
0:45:15
Navigation from Vision is a Continuous control task: A model that decides the steering angle from input image. Here the output is a continuous probability distribution. This is different from classification and segmentation task.
Figure 7: Navigation from Vision
In the above model, continuous control is done as follows:
- The top part of model doesn't see the route, and outputs a probabilistic control output
- The bottom part sees the route, and outputs a path to take given the route
The loss function here is interesting because in reality we won't take multiple paths but a single one at a given intersection. However, after seeing a bunch of intersections the model will learn the different paths that can be taken. (0:46:26)
\(L = - \log(P(\theta | I, M))\)