用于快速确定两个比特序列之间的相似性的包 [英] Package for fast determination of similarity between two bit sequences

查看:229
本文介绍了用于快速确定两个比特序列之间的相似性的包的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将查询位序列与高达一百万个位序列的数据库进行比较。所有位序列为100位长。我需要尽可能快的查找。有没有任何包,以快速确定两个位序列之间的相似性? - 编辑 - 位序列是位置敏感的。



我看到一个 Bit Twiddling Hacks可能的算法但是如果有一个现成的包会更好。

解决方案

如果你想查找50个最匹配的模式,我们可以假设输入数据集是静态的(或可以动态更新),你可以重复因此:




  • 对于每种位模式,计算位数。

  • 将位模式存储在multi_map中(如果使用STL,Java可能具有类似的东西)



然后,使用以下算法:




  • 创建2个集合:一个用于存储找到的模式,一个用于存储可能的好模式(此第二个集合应该是地图,距离到模式)

  • 取自己的模式并计算位数,假设为N

  • 在索引N处查看多重映射,

  • 比较索引N处的所有模式。如果它们相等,则将结果存储在第一个集合中。如果它们不相等,则使用差值作为键将结果存储在第二个集合/地图中。

  • 在索引N-1处查看多重地图,所有这些模式将有一个距离of 1 or more

  • 比较索引N-1处的所有模式。如果它们的距离为1,则将它们存储在第一个集合中。

  • 对索引N + 1重复

  • 如果距离较远,则将结果存储在第二个集合/ >现在查看第二个集合/地图,看看是否有距离1存储的内容。如果是,则从第二个集合/地图中删除它们,并将它们存储在第一个集合中。



    • 对距离2,距离3,...重复此操作,直到有足够的模式。



      所需的模式不是太大,并且平均距离也不会太大,那么模式之间的真实比较的数量可能只有几个%。



      不幸的是,模式将使用高斯曲线分布,仍然会有相当多的模式要检查。我没有做一个数学检查,但在实践中,如果你不想要太多的模式出了百万,而平均距离不是太远,你应该能够找到一组最接近



      请让我更新您的结果。


      I need to compare a query bit sequence with a database of up to a million bit sequences. All bit sequences are 100 bits long. I need the lookup to be as fast as possible. Are there any packages out there for fast determination of the similarity between two bit sequences? --Edit-- The bit sequences are position sensitive.

      I have seen a possible algorithm on Bit Twiddling Hacks but if there is a ready made package that would be better.

      解决方案

      If you want to look up the, let's say 50, most matching patterns, and we can assume that the input data set is rather static (or can be dynamically updated), you can repeat the initial phase of the previous comment, so:

      • For every bit pattern, count the bits.
      • Store the bit patterns in a multi_map (if you use STL, Java probably has something similar)

      Then, use the following algorithm:

      • Make 2 collections: one for storing the found patterns, one for storing possibly good patterns (this second collection should probably be map, mapping 'distances' to patterns)
      • Take your own pattern and count the bits, assume this is N
      • Look in the multimap at index N, all these patterns will have the same sum, but not necessarily be completely identical
      • Compare all the patterns at index N. If they are equal store the result in the first collection. If they are not equal, store the result in the second collection/map, using the difference as key.
      • Look in the multimap at index N-1, all these patterns will have a distance of 1 or more
      • Compare all the patterns at index N-1. If they have a distance of 1, store them in the first collection. If they have a larger distance, store the result in the second collection/map, using the difference as key.
      • Repeat for index N+1
      • Now look in the second collection/map and see if there is something stored with distance 1. If it is, remove them from the second collection/map and store them in the first collection.

      Repeat this for distance 2, distance 3, ... until you have enough patterns.

      If the number of required patterns is not too big, and the average distance is also not too big, then the number of real compares between patterns is probably only a few %.

      Unfortunately, since the patterns will be distributed using a Gaussian curve, there will still be quite some patterns to check. I didn't do a mathematical check on it, but in practice, if you don't want too many patterns out of the millions, and the average distance is not too far, you should be able to find the set of most-close patterns by checking only a few percent of the total bit patterns.

      Please keep me updated of your results.

      这篇关于用于快速确定两个比特序列之间的相似性的包的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆