快速汉明距离得分 [英] Fast Hamming distance scoring

查看:131
本文介绍了快速汉明距离得分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一个具有N个固定长度字符串的数据库. 有一个相同长度的查询字符串. 问题是要从数据库中提取距k的汉明距离最短的前k个字符串.

There is a database with N fixed length strings. There is a query string of the same length. The problem is to fetch first k strings from the database that have the smallest Hamming distance to q.

N小(大约400),弦长,长度固定.数据库不会更改,因此我们可以预先计算索引.查询差异很大,无法选择缓存和/或预先计算.每秒有很多.即使k-1个结果匹配为0,我们也总是需要k个结果(按汉明距离排序并取第一个k,因此对位置敏感的哈希和类似方法将不起作用). kd-tree和类似的空间分区可能比线性搜索的效果更差(字符串可能很长). BK-tree是目前的最佳选择,但它仍然比所需的速度慢和复杂.

N is small (about 400), strings are long, fixed in length. Database doesn't change, so we can pre-compute indexes. Queries vary strongly, caching and/or pre-computation is not an option. There are lots of them per second. We need always k results, even if k-1 results have match 0 (sorting on Hamming distance and take first k, so locality sensitive hashing and similar approaches won't do). kd-tree and similar space partitioning will probably perform worser than linear search (strings can be very long). BK-tree is currently best choice, but it is still slow and complicated than it needs to be.

感觉好像有一种算法,它将建立一个索引,该索引将以非常少的步骤丢弃大部分条目,从而使k< = t<< N个条目可计算实际汉明距离.

It feels like there is an algorithm, which will build an index, which will discard most of the entries in very few steps, leaving k <= t << N entries to compute real Hamming distance.

人们建议根据Levenstein距离进行模糊字符串匹配-谢谢,但是问题要简单得多.基于广义距离度量的方法(如BK树)是很好的方法,但也许可以利用上述事实(小数据库/固定大小的长字符串,简单的汉明距离)

People suggesting fuzzy string matching based on Levenstein distance - thanks, but the problem is much simpler. Generalized distance metric based approaches (like BK-trees) are good, but maybe there something utilizing the facts described above (small DB/long fixed size strings, simple Hamming distance)

链接,关键字,论文,想法? =)

Links, keywords, papers, ideas? =)

推荐答案

这似乎是一项任务,其中 Vantage Point(VP树)可能会起作用...由于汉明距离应满足三角形不等式定理,因此您应该能够应用它...对于识别k最近的值也很有用.我已经在图像索引数据库设置中看到它了……您可以查看本文作为我正在谈论的内容的一个示例(尽管在不同的领域).

This seems like a task where a Vantage Point (VP tree) might work... since the hamming distance should satisfy the triangle inequality theorem, you should be able to apply it... its also good for identifying k-closest. I've seen it in image indexing database setups... you might check section 5 of this paper as an example of what I'm talking about (albeit in a different field).

这篇关于快速汉明距离得分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆