如何加快此BIT_COUNT查询的汉明距离? [英] How do I speed up this BIT_COUNT query for hamming distance?

查看:238
本文介绍了如何加快此BIT_COUNT查询的汉明距离?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个php脚本,用于检查从安全摄像机拍摄的2张静态照片之间的汉明距离.

I have a php script that checks hamming distance between 2 still photos taken from a security camera.

该表是具有240万行的mySQL,由一个键和4个INT(10)组成. INT(10)已分别,一起以及与键一起被索引,但是我没有明显的证据表明任何组合都比其他组合要快.如果您建议这样做,我可以再试一次.

The table is mySQL with 2.4M rows, and consists of a Key and 4 INT(10)s. The INT(10)s have been indexed individually, together, and together with the Key, but I don't have significant evidence that any combination was faster than the others. I can try again if you suggest to do so.

通过将图像转换为8x16像素来计算汉明权重,并将每四分之一的位存储在pHash0,pHash1 ...等列中.

The hamming weights are calculated by transforming the image into 8x16 pixels, and each quarter of the bits is stored in a column, pHash0, pHash1... etc.

我有两种写法.第一种方法是使用嵌套的派生表.从理论上讲,每个派生应检查的数据要少于其前身.该查询是一个准备好的语句,并且?字段是我要检查的文件的pHash [0-3].

There are 2 ways I have written it. The first way was to use nested derived tables. Theoretically, each derivation should have lesser data to check than it's predecessor. The query is a prepared statement, and the ? fields are the pHash[0-3] of the file I'm checking against.

Select
    `Key`,
    Bit_Count(T3.pHash3 ^ ?) + T3.BC2 As BC3
  From
    (Select
      *,
      Bit_Count(T2.pHash2 ^ ?) + T2.BC1 As BC2
    From
      (Select
        *,
        Bit_Count(T1.pHash1 ^ ?) + T1.BC0 As BC1
      From
        (Select
          `Key`,
          pHash0,
          pHash1,
          pHash2,
          pHash3,
          Bit_Count(pHash0 ^ ?) As BC0
        From
          files
        Where
          Not pHash0 Is Null And
          Bit_Count(pHash0 ^ ?) < 4) As T1
      Where
        Bit_Count(T1.pHash1 ^ ?) + T1.BC0 < 4) As T2
    Where
      Bit_Count(T2.pHash2 ^ ?) + T2.BC1 < 4) As T3
  Where
    Bit_Count(T3.pHash3 ^ ?) + T3.BC2 < 4

第二种方法更为直接.它只是一次完成了所有工作.

The second approach was a bit more direct. It just did all of the work at once.

Select
    `Key`,
  From
    files
  Where
    Not pHash0 is null AND
    Bit_Count(pHash0 ^ ?) + Bit_Count(pHash1 ^ ?) + Bit_Count(pHash2 ^
    ?) + Bit_Count(pHash3 ^ ?) < 4

在大型记录集上,第一个查询更快,而在较小的记录集上,第二个查询更快,但是对于240万条记录,每次比较都不会超过1-1/3秒.

The first query is faster on large recordsets, while the second is faster on smaller recordsets, but neither will exceed 1-1/3 seconds per compare on 2.4M records.

您是否看到一种调整此过程以使其更快的方法?任何建议都可以快速尝试,例如更改数据类型或索引.

Do you see a way of tweaking this process to make it faster? Any suggestions can be quickly tried, such as changing datatypes or indexes.

设置为Win7x64,MySQL/5.6.6和InnoDB,nginx/1.99,php-cgi/7.0.0(启用了zend).该脚本是从网页上调用的,并且已关闭缓冲功能以立即获得反馈.

The setup is Win7x64, MySQL/5.6.6 and InnoDB, nginx/1.99, php-cgi/7.0.0 with zend enabled. The script is called from a webpage, and has buffering turned off for immediate feedback.

如果我将4个32位整数更改为1个binary(16),可能会更好,这会将比较值从4更改为1,但是我还必须将4个参数转换为128位字符,哪个php不会做.如果有一种快速的组合方法,则可能会浪费更多时间.

It might work better if I change the 4 32-bit integers to 1 binary(16), which would change the compares from 4 to one, but I'd also have to convert my 4 parameters to a 128-bit character, which php won't do. If there was a fast way to combine them, it might squeeze a bit more time off.

编辑 可接受的答案使速度提高了约500%.我们的假设的简要提要:pHash"A"的位数始终在pHash"B" +/-汉明距离之内.

EDIT The accepted answer has increased the speed by ~500%. A quick synopsis of our hypothesis: The bitcount of pHash "A" will always be within pHash "B" +/- Hamming Distance.

特别感谢@duskwuff的坚韧和耐心.欢呼@duskwuff!

Special thanks to @duskwuff for tenacity and patience. Cheers @duskwuff!

编辑 这是我最近的查询:

EDIT This was my most recent query:

Select
  files.`Key`, 
  Bit_Count(? ^ pHash0) + Bit_Count(? ^ pHash1) +
  Bit_Count(? ^ pHash2) + Bit_Count(? ^ pHash3) as BC
  From
    files FORCE INDEX (bitcount)
  Where
    bitCount Between ? And ? 
  AND Bit_Count(? ^ pHash0) + Bit_Count(? ^ pHash1) +
  Bit_Count(? ^ pHash2) + Bit_Count(? ^ pHash3) <= ?
  ORDER BY Bit_Count(? ^ pHash0) + Bit_Count(? ^ pHash1) +
  Bit_Count(? ^ pHash2) + Bit_Count(? ^ pHash3)

前4个在哪里?表示要检查的文件的4个32位哈希,接下来的2个?"代表该文件的预先计算的比特数+/-所需的汉明距离,最后一个?"代表那个汉明距离.只有将最接近的匹配带到顶部才需要ORDER BY子句,其中LIMIT 1子句将返回最佳匹配. bitcount字段上有一个B-TREE索引.

Where the first 4 "?" represent the 4 32-bit hashes of the file being checked, the next 2 "?" represent the pre-calculated bitcount of that file +/- the desired hamming distance, and the last "?" represents that hamming distance. The ORDER BY clause is necessary only to bring the closest matches to the top, where a LIMIT 1 clause will return the best match. There is a B-TREE index on the bitcount field.

从240万个文件中散发出的比特数呈钟形曲线,极端情况下为3或4,中间为70,000.如果给定的文件的位计数为64(这是最坏的情况),则查找汉明距离为3的文件意味着比较文件的20%(在我的情况下为490,000),而查找汉明距离为0的文件将比较仅占记录的2.8%(当然是70,000).

The dispersion of bitcounts from 2.4-million files fell into a bell curve, having 3 or 4 on the extremes, with 70,000 in the center. If given a file with a bitcount of 64 (which is worst-case), looking for files within a hamming distance of 3 means comparing 20% of the files (490,000 in my case), whereas looking for a hamming distance of 0 would compare only 2.8% of the records (70,000, of course).

推荐答案

请注意,由于BIT_COUNT(a)BIT_COUNT(b)之间的差异,BIT_COUNT(a ^ b)被限制在下方. (也就是说,它始终至少等于该差,并且可能更大.)如果您预先计算每行的总位数,则可以使用它来排除总位数与实际值相差太远的行您的目标.更好的是,您可以在该列上创建索引,然后将使用该索引.

Observe that BIT_COUNT(a ^ b) is bounded below by the difference between BIT_COUNT(a) and BIT_COUNT(b). (That is, it is always at least equal to the difference, and may be greater.) If you precalculate the total bit count for each row, you can use that to rule out rows which have a total bit count that's too far away from your target. Even better, you can create an index on that column, and that index will be used.

我要想到的是类似以下内容的东西:

What I'd have in mind would be something along the lines of:

ALTER TABLE files ADD COLUMN totalbits INTEGER;
CREATE INDEX totalbits_index ON files (totalbits);

UPDATE files SET totalbits = BIT_COUNT(pHash1) + BIT_COUNT(pHash2)
                           + BIT_COUNT(pHash3) + BIT_COUNT(pHash4);

SELECT `Key` FROM files WHERE (totalbits BETWEEN … AND …) AND …

请注意,使用此功能后,无需将散列拆分为四个块.将它们组合回一列将使事情变得更容易.

Note that, with this in place, there's no need to split your hashes into four chunks. Combining them back into a single column would make things easier.

这篇关于如何加快此BIT_COUNT查询的汉明距离?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆