了解和评估模板匹配方法 [英] Understanding and evaluating template matching methods

查看:100
本文介绍了了解和评估模板匹配方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

OpenCV具有matchTemplate()函数,该函数通过在输出上滑动模板输入,并生成与匹配项相对应的数组输出来进行操作.

在哪里可以了解有关如何解释六个 TemplateMatchModes的更多信息?

我已阅读并基于教程 a>,但除了理解一个人在寻找TM_SQDIFF匹配的最小结果而在其他人寻找最大的结果之外,我不知道如何解释不同的方法,以及一个人会选择另一个方法的情况. /p>

例如(摘自本教程)

 res = cv.matchTemplate(img_gray, template, cv.TM_CCOEFF_NORMED)
threshold = 0.8
loc = np.where(res >= threshold)
 

R(x,y)= ∑x′,y′ (T′(x′,y′) ⋅ I′(x+x′,y+y′))
        −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−-------------
       sqrt(∑x′,y′ T′(x′,y′)^2 ⋅ ∑x′,y′ I′(x+x′,y+y′)^2)

(来自文档页面;不确定如何进行方程式格式化)

我推断TM_CCOEFF_NORMED将返回0到1之间的值,并且0.8阈值是任意的,但这只是假设.

在线上是否需要更深入地研究方程式,针对标准数据集的性能测量,有关不同模式以及何时以及为何使用一种模式的学术论文?

解决方案

所有模板匹配模式都可以粗略地分类为密集的(按像素表示)相似性度量,或者等效但相反,图像之间的距离指标.

通常,您将有两张图像,并且想要以某种方式进行比较.模板匹配并不能直接帮助您匹配缩放,旋转或扭曲的对象.模板匹配严格涉及准确测量两个图像的相似度.但是,这里使用的实际度量标准在计算机视觉中无处不在,包括查找图像之间的转换……只是通常还会进行更复杂的步骤(例如梯度下降以找到最佳的转换参数).

距离度量有很多选择,并且根据应用程序的不同,它们通常具有优缺点.


绝对差之和(SAD)

首先,最基本的距离度量只是两个值之间的绝对差,即d(x, y) = abs(x - y).对于图像,从单个值扩展此范围的简单方法是将所有这些距离相加,逐个像素,从而得出绝对差之和(SAD)度量;它也被称为 Manhattan 出租车距离,并定义了 L1规范.令人讨厌的是,这并没有实现为OpenCV的模板匹配模式之一,但是在本次讨论中,与SSD进行比较仍然很重要.

在模板匹配方案中,您可以沿着多个位置滑动模板,然后简单地找出差异最小的位置.这等效于询问数组[1,4,9]中最接近5的索引是什么.您将数组中每个值的绝对差取为5,而索引1的差最小,因此这是最接近的匹配项的位置.当然,在模板匹配中,值不是5,而是一个数组,图像是一个更大的数组.

平方差总和(SSD):TM_SQDIFF

SAD度量标准的一个有趣功能是,它不会惩罚真正的大差异,而不会惩罚一堆非常小的差异.假设我们要使用以下向量计算d(a, b)d(a, c):

a = [1, 2, 3]
b = [4, 5, 6]
c = [1, 2, 12]

以元素为单位的绝对差之和,我们看到了

SAD(a, b) = 3 + 3 + 3 = 9 = 0 + 0 + 9 = SAD(a, c)

在某些应用程序中,可能没关系.但是在其他应用程序中,您可能希望这两个距离实际上是完全不同的.对差值进行平方处理,而不是取其绝对值,会对与您期望的值相比更远的值进行惩罚-随着值的差值增大,图像会变得越来越远.它将更多地映射到某人可能如何解释估算值偏离 的情况,即使在价值上实际上不是 相距遥远.平方差之和(SSD)等效于 L2范数的距离函数欧几里得距离.使用SSD时,我们看到两个距离现在大不相同了:

SSD(a, b) = 3^2 + 3^2 + 3^2 = 27 != 81 = 0^2 + 0^2 + 9^2 = SSD(a, c)

您可能会看到 L1规范有时被称为鲁棒规范.这是因为单点错误不会比错误本身扩大距离.但是,当然,对于固态硬盘,离群值会使距离更大.因此,如果您的数据容易出现一些非常遥远的值,请注意,SSD可能不是您的良好相似性指标.一个很好的例子可能是比较曝光过度的图像.在图像的某些部分中,您可能只有白色的天空,而其他的根本不是白色的,因此图像之间的距离会很大.

当比较的两个图像相同时,SAD和SSD的最小距离均为0.它们都总是非负的,因为绝对差或平方差总是非负的.

交叉相关(CC):TM_CCORR

SAD和SSD通常都是离散指标-因此,它们是采样信号(如图像)的自然考虑因素.然而,互相关也适用于连续信号,因此也适用于模拟信号,这是其在信号处理中普遍存在的一部分.对于广泛的信号,尝试检测信号中模板的存在被称为 matched filter ,您基本上可以将其视为模板匹配的连续模拟.

交叉相关仅将两个图像相乘.您可以想象,如果两个信号精确对齐,将它们相乘将仅使模板平方.如果它们不是按原样排列的,那么乘积将变小.因此,产品最大化的地方就是他们排列最佳的地方.但是,当您将互相关用作不确定的信号的相似性度量时,互相关会出现问题,通常在以下示例中显示.假设您有三个数组:

a = [2, 600, 12]
b = [v, v, v]
c = [2v, 2v, 2v]

广泛地,在ab之间也没有明显的相关性,在ac之间也没有明显的相关性.通常,ab的关联不应大于与c的关联.但是,它是产品,因此是ccorr(a, c) = 2*ccorr(a, b).因此,这对于尝试在较大图像中查找模板并不理想.而且由于我们正在处理具有定义的最大值(图像)的离散数字信号,因此这意味着图像的亮白色斑块基本上总是具有最大的相关性.由于这个问题,TM_CCORR作为模板匹配方法并不是特别有用.

均值互相关(Pearson相关系数):TM_CCOEFF

解决与亮色块相关的问题的一种简单方法是在比较信号之前简单地减去均值.这样,简单移位的信号与未移位的信号具有相同的相关性.凭我们的直觉,这是有道理的-在一起变化的信号是相互关联的.

规范化:TM_SQDIFF_NORMEDTM_CCORR_NORMEDTM_CCOEFF_NORMED

OpenCV中的所有方法均被标准化.标准化的重点不是 来提供置信度/概率,而是提供一种度量标准,您可以将其与不同大小的模板或具有不同比例值的模板进行比较.例如,假设我们要查找对象是否在图像中,并且该对象有两个不同的模板.两种不同的模板大小不同.我们可以通过像素数量进行归一化,这将可以比较不同大小的模板.但是,说我的模板实际上在强度上有很大的不同,就像一个模板的像素值方差比另一个模板大得多.通常,在这种情况下,您要做的是除以标准偏差(与均值平方差之和的平方根). OpenCV使用TM_CCOEFF_NORMED方法做到了这一点,因为均值差的平方和 是方差,但是其他方法并不是均值漂移的,因此缩放只是对和的和的度量.图像值.无论哪种方式,结果都是相似的,您希望通过与所使用图像块的强度有关的缩放比例进行缩放.

其他指标

OpenCV还没有提供其他有用的指标. Matlab提供了SAD以及最大绝对差度量(MaxAD),也称为均匀距离度量,并给出了L∞范数.基本上,您采用最大绝对差而不是它们的总和.在优化设置中通常会看到其他使用的指标,例如,增强的相关系数建议用于立体声匹配,然后再进行扩展以进行对齐.该方法在OpenCV中使用,但不用于模板匹配.您可以在 computeECC() findTransformECC() .


使用哪种方法?

在大多数情况下,您会看到使用标准化和未标准化的SSD(TM_SQDIFF_NORMEDTM_SQDIFF),以及使用零归一化互相关/ZNCC(TM_CCOEFF_NORMED).有时您可能会看到TM_CCORR_NORMED,但是却很少见.根据我在网上发现的一些讲义(一些不错的例子和直觉) Trucco和Verri的CV书中指出,一般而言,SSD比关联性更好,但是我没有T& V的书来了解为什么他们建议这样做;大概是在真实世界的照片上进行比较.但是,尽管如此,SAD和SSD绝对有用,尤其是在数字图像上.

我不知道任何一种在大多数情况下或某些情况下本质上都更好的确定性示例,我认为这确实取决于您的图像和模板.一般来说,我会说:如果您要寻找完全匹配或非常接近完全匹配的内容,请使用SSD.它速度很快,并且可以肯定地映射到您要最小化的内容(模板和图像补丁之间的差异).在这种情况下,无需进行标准化,这只是增加了开销.如果您有类似的要求,但需要多个模板才能比较,则请标准化SSD.如果您正在寻找匹配项,但您正在处理可能存在曝光或对比度差异的真实照片,则ZNCC的均值漂移和方差均衡可能是最好的.

对于选择正确的阈值,ZNCC或SSD的值根本不是置信度或概率数.如果要选择正确的阈值,则可以采用多种典型方法来测量参数.您可以计算不同阈值的ROC曲线或PR曲线.您可以使用回归来找到最佳参数.您需要标记一些数据,但至少要对一些测试集进行度量,以确保您的选择不是任意的.像往常一样,在充满数据的字段中,您需要确保数据尽可能接近真实示例,并且测试数据涵盖了边缘情况以及典型图像.

OpenCV has the matchTemplate() function, which operates by sliding the template input across the output, and generating an array output corresponding to the match.

Where can I learn more about how to interpret the six TemplateMatchModes?

I've read through and implemented code based on the tutorial, but other than understanding that one looks for minimum results for TM_SQDIFF for a match and maximums for the rest, I don't know how to interpret the different approaches, and the situations where one would choose one over another.

For example (taken from the tutorial)

res = cv.matchTemplate(img_gray, template, cv.TM_CCOEFF_NORMED)
threshold = 0.8
loc = np.where(res >= threshold)

and

R(x,y)= ∑x′,y′ (T′(x′,y′) ⋅ I′(x+x′,y+y′))
        −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−-------------
       sqrt(∑x′,y′ T′(x′,y′)^2 ⋅ ∑x′,y′ I′(x+x′,y+y′)^2)

(taken from the doc page; not sure how to do equation formatting)

I would infer that TM_CCOEFF_NORMED would return values between 0 and 1, and that the 0.8 threshold is arbitrary, but that is just supposition.

Are there deeper dives into the equations online, measurements of performance against standard datasets, or academic papers about the different modes and when and why to use one over another?

解决方案

All of the template matching modes can be classified roughly as a dense (meaning pixel-wise) similarity metric, or equivalently but inversely, a distance metric between images.

Generally, you will have two images and you want to compare them in some way. Off the bat, template matching doesn't directly help you match things that are scaled, rotated, or warped. Template matching is strictly concerned with measuring the similarity of two images exactly as they appear. However, the actual metrics used here are used everywhere in computer vision, including finding transformations between images...just usually there's more complex steps going on in addition (like gradient descent to find the optimal transformation parameters).

There are many choices for distance metrics, and they generally have pros and cons depending on the application.


Sum of absolute differences (SAD)

For a first start, the most basic distance metric is just the absolute difference between two values, i.e. d(x, y) = abs(x - y). For images, an easy way to extend this from single values is just to sum all of these distances, pixel-wise, leading to the sum of absolute differences (SAD) metric; it is also known as the Manhattan or the taxicab distance, and defines the L1 norm. Annoyingly, this isn't implemented as one of OpenCV's template matching modes, but it's still important in this discussion as a comparison to SSD.

In the template matching scenario, you slide a template along multiple places and simply find where the smallest difference occurs. It is the equivalent to asking what the index of the closest value to 5 is in the array [1, 4, 9]. You take the absolute difference of each value in the array with 5, and index 1 has the smallest difference, so that's the location of the closest match. Of course in template matching the value isn't 5 but an array, and the image is a larger array.

Sum of square differences (SSD): TM_SQDIFF

An interesting feature of the SAD metric is that it doesn't penalize really big differences any more than a bunch of really small differences. Let's say we want to compute d(a, b) and d(a, c) with the following vectors:

a = [1, 2, 3]
b = [4, 5, 6]
c = [1, 2, 12]

Taking the sums of absolute differences element-wise, we see

SAD(a, b) = 3 + 3 + 3 = 9 = 0 + 0 + 9 = SAD(a, c)

In some applications, maybe that doesn't matter. But in other applications, you might want these two distances to actually be quite different. Squaring the differences, instead of taking their absolute value, penalizes values that are further from what you expect---it makes the images more distant as the difference in value grows. It maps more to how someone might explain an estimate as being way off, even if in value it's not actually that distant. The sum of square differences (SSD) is equivalent to the squared Euclidean distance, the distance function for the L2 norm. With SSD, we see our two distances are now quite different:

SSD(a, b) = 3^2 + 3^2 + 3^2 = 27 != 81 = 0^2 + 0^2 + 9^2 = SSD(a, c)

You may see that the L1 norm is sometimes called a robust norm. This is specifically because a single point of error won't grow the distance more than the error itself. But of course with SSD, an outlier will make the distance much larger. So if your data is somewhat prone to a few values that are very distant, note that SSD is probably not a good similarity metric for you. A good example might be comparing images that may be overexposed. In some part of the image, you may just have white sky where the other is not white at all, and you'll get a massive distance between images from that.

Both SAD and SSD have a minimum distance of 0, when the two images compared are identical. They're both always non-negative since the absolute differences or square differences are always non-negative.

Cross correlation (CC): TM_CCORR

SAD and SSD are both generally discrete metrics---so they're a natural consideration for sampled signals, like images. Cross correlation however is applicable as well to continuous, and therefore analog, signals, which is part of its ubiquity in signal processing. With signals broadly, trying to detect the presence of a template inside a signal is known as a matched filter, and you can basically think of it as the continuous analog of template matching.

Cross correlation just multiplies the two images together. You can imagine that if the two signals line up exactly, multiplying them together will simply square the template. If they're not lined up just-so, then the product will be smaller. So, the location where the product is maximized is where they line up the best. However, there is a problem with cross correlation in the case when you're using it as a similarity metric of signals you're not sure are related, and that is usually shown in the following example. Suppose you have three arrays:

a = [2, 600, 12]
b = [v, v, v]
c = [2v, 2v, 2v]

Broadly, there's no obvious correlation between a and b nor a and c. And generally, a shouldn't correlate any more to b than to c. But, it's a product, and thus ccorr(a, c) = 2*ccorr(a, b). So, thats not ideal for trying to find a template inside a larger image. And because we're dealing with discrete digital signals that have a defined maximum value (images), that means that a bright white patch of the image will basically always have the maximum correlation. Because of this issues, TM_CCORR is not particularly useful as a template matching method.

Mean shifted cross correlation (Pearson correlation coefficient): TM_CCOEFF

One simple way to solve the problem of correlating with bright patches is to simply subtract off the mean before comparing the signals. That way, signals that are simply shifted have the same correlation as those that are unshifted. And this makes sense with our intuition---signals that vary together are correlated.

Normalization: TM_SQDIFF_NORMED, TM_CCORR_NORMED, TM_CCOEFF_NORMED

All of the methods in OpenCV are normalized the same. The point of normalization is not to give a confidence/probability, but to give a metric that you can compare against templates of different sizes or with values at different scales. For example, let's say we want to find if an object is in an image, and we have two different templates of this object. The two different templates are different sizes. We could just normalize by the number of pixels, which would work to compare templates of different sizes. However, say my templates are actually quite different in intensities, like one has much higher variance of the pixel values than the other. Typically, what you'd do in this case is divide by the standard deviation (square root of the sum of squared differences from the mean). OpenCV does do this with the TM_CCOEFF_NORMED method, since the squared sum of the mean differences is the variance, but the other methods aren't mean shifted, so the scaling is just a measure of sum of the image values. Either way, the result is similar, you want to scale by something that relates to the intensity of the image patches used.

Other metrics

There are other useful metrics that OpenCV does not provide. Matlab provides SAD, as well as the maximum absolute difference metric (MaxAD), which is also known as the uniform distance metric and gives the L∞ norm. Basically, you take the max absolute difference instead of the sum of them. Other metrics that are used are typically seen in optimization settings, for example the enhanced correlation coefficient which was first proposed for stereo matching, and then later expanded for alignment in general. That method is used in OpenCV, but not for template matching; you'll find the ECC metric in computeECC() and findTransformECC().


Which method to use?

Most often, you will see normed and un-normed SSD (TM_SQDIFF_NORMED, TM_SQDIFF), and zero-normalized cross-correlation / ZNCC (TM_CCOEFF_NORMED) used. Sometimes you may see TM_CCORR_NORMED, but less often. According to some lecture notes I found online (some nice examples and intuition there on this topic!), Trucco and Verri's CV book states that generally SSD works better than correlation, but I don't have T&V's book to see why they suggest that; presumably the comparison is on real-world photographs. But despite that, SAD and SSD are definitely useful, especially on digital images.

I don't know of any definitive examples of one or the other being inherently better in most cases or something---I think it really depends on your imagery and template. Generally I'd say: if you're looking for exact or very close to exact matches, use SSD. It is fast, and it definitely maps to what you're trying to minimize (the difference between the template and image patch). There's no need to normalize in that case, it is just added overhead. If you have similar requirements but need multiple templates to be comparable, then normalize the SSD. If you're looking for matches, but you're working with real-world photographs that may have exposure or contrast differences, the mean shifting and variance equalization from ZNCC will likely be the best.

As for picking the right threshold, the value from ZNCC or SSD is not a confidence or probability number at all. If you want to pick the right threshold, you can measure the parameter in any number of typical ways. You can calculate ROC curves or PR curves for different thresholds. You can use regression to find the optimal parameter. You'll need to label some data, but then at least you'll have measurements of how you're doing against some test set so that your choice is not arbitrary. As usual with a data-filled field, you'll need to make sure your data is as close to real world examples as possible, and that your test data covers your edge cases as well as your typical images.

这篇关于了解和评估模板匹配方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆