带有Keras的Yolo v3模型输出说明 [英] Yolo v3 model output clarification with keras

查看:657
本文介绍了带有Keras的Yolo v3模型输出说明的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将yolo v3模型与keras一起使用,并且该网络将我作为输出容器,其形状如下:

I'm using yolo v3 model with keras and this network is giving me as output container with shape like this:

[(1, 13, 13, 255), (1, 26, 26, 255), (1, 52, 52, 255)]

所以我找到了这个链接

然后,我了解了3个容器中的每个容器的值255,也了解了3个容器,因为边界框的创建有3种不同的图像缩放比例.

Then I understand the value 255 in each of the 3 containers, I also understand that there is 3 containers because there is 3 different image scaling for bounding boxes creation.

但是我不明白为什么在输出向量中第一个缩放比例有13 * 13个列表,而第二个缩放比例有26 * 26个列表,最后一个有52 * 52个.

But I did not understand why in the output vector there are 13 * 13 lists for the first scaling rate then 26 *26 lists for the second then 52 * 52 for the last.

我无法找到一些很好的解释,所以我不能使用此网络. 如果有人知道我在哪里可以找到有关输出尺寸的一些信息,我将非常感激.

I can't manage to find some good explanations about that so I can't use this network. If someone knows where I can find some information about the output dimension I would be very greatful.

编辑

是因为如果我将图像按13个部分切成13个部分,考虑到每个部分都是对象的中心,我只能检测13 * 13个对象?

Is it because if i cut the image in 13 by 13 sections i'm only able to detect 13*13 objects considering that each sections are the center of an object ?

推荐答案

YOLOv3具有3个输出层.此输出层以3个不同的比例预测盒子坐标. YOLOv3还以将图像划分为单元格的方式进行操作.根据哪个输出层,您可以看到单元数的不同.

YOLOv3 have 3 output layers. This output layers predict box coordinates at 3 different scales. YOLOv3 also operates at such way that divide image to grid of cells. Base on which output layer you look the number of cells is different.

所以输出数量正确,3个列表(由于三个输出层).您必须考虑YOLOv3是完全卷积的,这意味着输出层是宽x高x滤镜.看一下第一个形状(1、13、13、255).您了解255代表边界框坐标&类和置信度,1代表批量.您现在无法理解输出是conv2d的,所以有问题的部分为13 x13.13x 13意味着您的输入图像将被划分为网格,并且将为网格的每个单元格预测边界框坐标,类概率等.第二层进行操作在不同的比例下,您的图像将被划分为26 x 26网格,第三个图像会将您的图像划分为52 x 52网格,并且网格中的每个单元格都将被预测为边界框坐标.

So number of outputs is right, 3 lists(because of three output layers). You must consider that YOLOv3 is fully convolutional which means that output layers are width x height x filters. Look at first shape (1, 13, 13, 255) . You understand that 255 stand for bounding box coordinates & classes and confidence, 1 stands for batch size. You now undrestand that output is conv2d so problematic parts are 13 x 13. 13 x 13 means that your input image will be divide into the grid and for each cell of the grid will be predicted bounding box coordinates, classes probabilities etc. Second layer operates at different scale and your image will be divided to grid 26 x 26, third one will divide your image to grid 52 x 52 and also for every cell at the grid will be predicted bounding boxes coordinates.

为什么有用?从实际的角度来看,想象一下有多少只小鸽子聚集在某个地方的地方.当您只有一个13 x 13输出层时,所有这些鸽子都可以出现在一个网格中,因此您不会一一检测到它们.但是,如果将图像划分为52 x 52网格,则单元格会变小,并且检测到所有单元格的可能性更高.检测小物件是YOLOv2的投诉,所以这是响应.

Why it is useful? From practical point of view, imagine picture where are many little pigeons concentrated at some place. When you have only one 13 x 13 output layer all this pigeons can be present at one grid, so you don't detect them one by one because of this. But if you divide your image to 52 x 52 grid, your cells will be small and there is higher chance that you detect them all. Detection of small objects was complaint against YOLOv2 so this is the response.

从更多的机器学习角度来看.这是所谓的要素金字塔的实现.这个概念被Retina网络架构所推广.

From more machine learning point of view. This is implementation of something which is called feature pyramid. This concept is popularized by Retina network architecture.

您可以处理输入图像,进行卷积,最大池合并等操作,直到将某个点用作输出层的输入(在YOLOv3情况下为13 x 13).比起用作13 x 13层输入并与相应大小的特征图串联的高级特征图(此特征图将取自网络的较早部分)而言.因此,现在您将使用经过扩展的要素作为输出层的输入,这些要素是沿着网络一直进行预处理的,而要素则是先前计算的.这样可以提高准确性.对于YOLOv3,您再一次将这个升级的功能与以前的功能串联在一起,将它们升级,串联并用作第三输出层的输入.

You process input image, apply convolutions, maxpooling etc. up to some point, this feature map you use as input to your output layer(13 x 13 in YOLOv3 case). Than you upscale feature map which was use as input for 13 x 13 layer and concatenate with feature map with corresponding size(this feature map will be taken from earlier part of network). So now you use as input for your output layer upscaled features which was preprocessed all the way along the network and feature which was computed earlier. And this leads to more accuracy. For YOLOv3 you than again take this upscaled features concatenated with earlier features upscale them, concatenate and use as input for third output layer.

这篇关于带有Keras的Yolo v3模型输出说明的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆