深度学习中的接受领域大小和对象大小 [英] Receptive feild size and object size in deep learning

查看:139
本文介绍了深度学习中的接受领域大小和对象大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以计算出VGGNet的500 x 500输入图像的接收场大小。



接收场大小如下。

 图层名称= conv1,输出大小= 500,步幅= 1,RF大小= 3 
图层名称= relu1_1,输出大小= 500,步幅= 1,RF大小= 3
层名称= conv1_2,输出大小= 500,步幅= 1,RF大小= 5
层名= relu1_2,输出大小= 500,步幅= 1,RF大小= 5
层名称= pool1,输出大小= 250,步幅= 2,RF大小= 6
层名称= conv2_1,输出大小= 250,步幅= 2,RF大小= 10
层名称= relu2_1,输出大小= 250,步幅= 2,RF大小= 10
图层名称= conv2_2,输出大小= 250,步幅= 2,RF大小= 14
图层名称= relu2_2,输出大小= 250,步幅= 2,射频大小= 14
层名称= pool2,输出大小= 125,步幅= 4,射频大小= 16
层名称= conv3_1,输出大小= 125,步幅= 4, RF大小= 24
层名称= relu3_1,输出尺寸= 125,步幅= 4,RF尺寸= 24
层名称= conv3_2,输出尺寸= 125,步幅= 4,RF尺寸= 32
层名称= relu3_2,输出尺寸= 125,步幅= 4,RF大小= 32
层名称= conv3_3,输出大小= 125,步幅= 4,RF大小= 40
层名= relu3_3,输出大小= 125,步幅= 4,RF大小= 40
层名称= pool3,输出大小= 62,步幅= 8,RF大小= 44
层名称= conv4_1,输出大小= 62,步幅= 8,RF大小= 60
层名称= relu4_1,输出大小= 62,步幅= 8,RF大小= 60
图层名称= conv4_2,输出大小= 62,步幅= 8,RF大小= 76
图层名称= relu4_2,输出大小= 62,步幅= 8,RF大小= 76
层名称= conv4_3,输出大小= 62,步幅= 8,RF大小= 92
层名称= relu4_3,输出大小= 62,步幅= 8 ,RF大小= 92
层名称= pool4,输出大小= 31,步幅= 16,RF大小= 100
层名称= conv5_1,输出大小= 31,步幅= 16,RF大小= 13 2
层名称= relu5_1,输出大小= 31,步幅= 16,RF大小= 132
层名称= conv5_2,输出大小= 31,步幅= 16,RF大小= 164
层名称= relu5_2,输出大小= 31,步幅= 16,RF大小= 164
层名称= conv5_3,输出大小= 31,步幅= 16,RF大小= 196
层名= relu5_3,输出大小= 31,步幅= 16,RF大小= 196

我只看conv5_3。



例如,如果我的对象大小是150 x 150,而我的图像大小是500 x 500。



我可以说也就是说,从conv1到conv4_2的较早层的特征贴图仅包含对象的部分特征,而从conv5_2到conv5_3则包含几乎整个对象的特征。



我的考虑是否正确?



但是在conv5_3上,我的output_size只有31 x 31,所以我可以不能想象它如何表示图像中的整个对象,但是conv5_3图层中的每个像素都代表原始500 x 500图像的196 x 196大小。



是我的

解决方案

理论上是...



我可以说,从conv1到conv4_2的较早层的特征映射仅包含对象的部分特征,而从conv5_2到conv5_3则包含几乎整个对象的特征。我的考虑是真的吗?


是的!您甚至计算出了自己的感受野(对于CNN而言,是图像中的像素在理论上可以影响特征图一个像元的值)!


但是在conv5_3上,我的output_size仅是31 x 31,因此我无法可视化它如何表示图像中的整个对象,但是conv5_3层中的每个像素都表示196 x 196的大小原始500 x 500的图像。我的考虑是真的吗?


是的!但是请不要忘记,尽管要素地图的大小仅为31x31,但要素的步幅却是16。因此, conv5_3 要素地图的每个单元格都代表196x196的区域图像(请记住,如果输入窗口不适合图像内部,则输入窗口的其余部分将为黑色,例如填充为零),并且彼此之间的步幅为16x16。这样31x31功能图仍然可以完全捕获图像(只是步幅很大)。




有效地...


好吧,上面我们讨论的是理论接受域,即图像中影响特征图中一个单元(或像素)的像素的概率大于0( 31x31)。但是,实际上,它在很大程度上取决于卷积内核的权重。


看看


如您所见,接受域根本无法覆盖整个补丁!因此,如果 conv5_3 的ERF比196x196小得多,就不要感到惊讶。




另外...


除了接受域的大小外,基本上说特征图上的这个单元会压缩图像中这一块的有价值的数据,您还需要这些特征来表现力足够。因此,请查看这篇文章或搜索 vgg可视化在Google上对功能本身的表现力有一些直觉。


I can calculate the receptive field size of 500 x 500 input image for VGGNet.

The receptive field sizes are as follow.

Layer Name = conv1, Output size = 500, Stride =   1, RF size =   3
Layer Name = relu1_1, Output size = 500, Stride =   1, RF size =   3
Layer Name = conv1_2, Output size = 500, Stride =   1, RF size =   5
Layer Name = relu1_2, Output size = 500, Stride =   1, RF size =   5
Layer Name = pool1, Output size = 250, Stride =   2, RF size =   6
Layer Name = conv2_1, Output size = 250, Stride =   2, RF size =  10
Layer Name = relu2_1, Output size = 250, Stride =   2, RF size =  10
Layer Name = conv2_2, Output size = 250, Stride =   2, RF size =  14
Layer Name = relu2_2, Output size = 250, Stride =   2, RF size =  14
Layer Name = pool2, Output size = 125, Stride =   4, RF size =  16
Layer Name = conv3_1, Output size = 125, Stride =   4, RF size =  24
Layer Name = relu3_1, Output size = 125, Stride =   4, RF size =  24
Layer Name = conv3_2, Output size = 125, Stride =   4, RF size =  32
Layer Name = relu3_2, Output size = 125, Stride =   4, RF size =  32
Layer Name = conv3_3, Output size = 125, Stride =   4, RF size =  40
Layer Name = relu3_3, Output size = 125, Stride =   4, RF size =  40
Layer Name = pool3, Output size =  62, Stride =   8, RF size =  44
Layer Name = conv4_1, Output size =  62, Stride =   8, RF size =  60
Layer Name = relu4_1, Output size =  62, Stride =   8, RF size =  60
Layer Name = conv4_2, Output size =  62, Stride =   8, RF size =  76
Layer Name = relu4_2, Output size =  62, Stride =   8, RF size =  76
Layer Name = conv4_3, Output size =  62, Stride =   8, RF size =  92
Layer Name = relu4_3, Output size =  62, Stride =   8, RF size =  92
Layer Name = pool4, Output size =  31, Stride =  16, RF size = 100
Layer Name = conv5_1, Output size =  31, Stride =  16, RF size = 132
Layer Name = relu5_1, Output size =  31, Stride =  16, RF size = 132
Layer Name = conv5_2, Output size =  31, Stride =  16, RF size = 164
Layer Name = relu5_2, Output size =  31, Stride =  16, RF size = 164
Layer Name = conv5_3, Output size =  31, Stride =  16, RF size = 196
Layer Name = relu5_3, Output size =  31, Stride =  16, RF size = 196

I look at only upto conv5_3.

For example, if my object size is 150 x 150 and my image size is 500 x 500.

Can I say that, the feature maps for earlier layers from conv1 to conv4_2 carry only partial features of my object and from conv5_2 to conv5_3, they carry the features of almost the whole object.

Is my consideration true?

But at conv5_3, my output_size is 31 x 31 only, so I can't visualize how it represents the whole object in the image, but every pixel in that conv5_3 layer represents 196 x 196 size of the original 500 x 500 image.

Is my consideration true?

解决方案

Theoretically...

Can I say that, the feature maps for earlier layers from conv1 to conv4_2 carry only partial features of my object and from conv5_2 to conv5_3, they carry the features of almost the whole object. Is my consideration true?

Yes! You even calculated yourself the receptive field (in the case of CNN, is the pixels in the image that can theoretically affect the value of one cell of the feature map)!

But at conv5_3, my output_size is 31 x 31 only, so I can't visualize how it represents the whole object in the image, but every pixel in that conv5_3 layer represents 196 x 196 size of the original 500 x 500 image. Is my consideration true?

Yes! But don't forget that although the feature map size is only 31x31, the stride of your features is 16. So each cell of the conv5_3 feature map represents a region 196x196 in the image (keep in mind that if the "input window" does not fit inside the image, the rest of the "input window" will be black e.g. filled with zero), and have stride 16x16 between each other. So that 31x31 feature map still fully capture the image (just that the stride is huge).


Effectively...

Okay, above we were talking about the theoretical receptive field, that is, the pixels in the image that have a probability larger than 0 of affecting one cell (or pixel) in the feature map (31x31, in that case). However, in practice, it heavily depends on the weights of your convolution kernels.

Take a look at this post about the effective receptive field (ERF) of CNNs (or, if you have plenty of time, go straight to the original paper).

In theory, when you stack more layers you can increase your receptive field linearly, however, in practice, things aren’t simple as we thought: not all pixels in the receptive field contribute equally to the output unit’s response.

What is actually more even interesting is that this receptive field is dynamic and changes during the training. The impact of this on the backpropagation is that the central pixels will have a larger gradient magnitude when compared to the border pixels.

Here are some figures from the papers that represents the ERF:

As you can see, the receptive field does not cover the whole patch at all! So don't be surprised if the ERF of the conv5_3 is much smaller than 196x196.


Also...

Apart from the size of receptive field, which basically says "this cell on feature map compresses valuable data from this patch of the image", you also need these features to be expressive enough. So, take a look at this post or search "vgg visualization" on google to have some intuitions on the expressiveness of the features itself.

这篇关于深度学习中的接受领域大小和对象大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆