是否有一个有效的算法对字符串列表的模糊重复数据删除? [英] Is there an efficient algorithm for fuzzy deduplication of string lists?

查看:190
本文介绍了是否有一个有效的算法对字符串列表的模糊重复数据删除?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如,我有一个字符串一个长长的清单,每串有大约30-50个字符,我想(从一个家庭重复的只留下一个事件)删除类似于在该列表中的一些其他串串

For example, I have a long list of strings, each string has about 30-50 characters, and I want to remove strings that are similar to some other string in that list (leaving only one occurrence from a family of duplicates).

我看着各种字符串相似的算法,例如,莱文施泰因距离和psented中的这篇文章。他们做的工作,但它是非常缓慢的 - 最好的算法,我想出了展品为O(n ^ 2)的复杂性,并采取〜1.5秒来处理列表3000串

I looked at various string similarity algorithms, for example, Levenstein distance and the method presented in this article. They do work, but it's painfully slow - the best algorithm I came up with exhibits O(n^2) complexity and takes ~1.5s to process list with 3000 strings.

有一些快速的方法来删除重复的名单?

Is there some fast way to deduplicate those lists?

推荐答案

发生此问题时经常匹配的DNA串(或重新组装的片段)。第一种方法是将字符串分割成 kmer S,子,用的的4个相邻的字母。因此,

This problem occurs frequently when matching DNA strings(or re-assembling fragments). The first approach would be to split up the strings into kmers, substrings, with say 4 adjacent letters. So

abcdefgh

将变成:

abcd + bcde + cdef + defg + efgh

有关完整的字典的,这些子可以enterered成一个哈希表,各持一个的的有效载荷的原字符串(他们的人数),包含它们的列表(和可能的偏移在那里他们可以找到)

For the complete dictionary, these substrings can be enterered into a hashtable, each carrying as a payload a list of the original strings (their numbers) that contain them (and possibly the offset where they can be found)

要搜索,治疗的测试字符串的一样的字典的,并期待它的片段在哈希表。现在的的将导致所有5个片段被发现,用正确的偏移量。的部分命中将产生少于5个片段,但用正确的偏移量。

To search, treat the string under test the same as the dictionary, and look its fragment up in the hashtable. Now a hit will result in all five fragments to be found, with the correct offsets. A partial hit will yield fewer than five fragments, but with the correct offsets.

当然有很多假阴性结果,将会导致从搜索,而是通过组合(逻辑与)的倒排索引的名单,并通过只选择命中在的关于的右手食指,事情就变得独一无二pretty的快。

Of course a lot of false-negative hits will result from the search, but by combining (logical AND) of the inverted index lists, and by only selecting the hits at about the right index, things get unique pretty fast.

有关问题的规模在OP的问题,运行时间可能会是几个(几十)毫秒。

For the problem size in the OP's question the running time would probably be a few (tens of ) milliseconds.

顺便说一句:作为这种方法的一个副作用,取代将表现几乎相同插入缺失。在该例子中它们将spoin酮(在末端)到4(在中间)kmer-匹配。对于较大的串这是没有问题的,对于小的字符串(在本例中它是(并且可以使用更小的片段)

BTW: As a side effect of this method, substitutions will behave almost the same as indels. In the example they would spoin one (at the ends) to four (in the middle) kmer-matches. For larger strings this is not a problem, for small strings (like in the example it is (and you could use smaller fragments)

更新:我刚才读的联系​​,而且似乎他们使用2聚体,太(和扔在它的一些统计数据)

Update: I just read the link, and it appears they use 2-mers, too (and throw some statistics at it)

这篇关于是否有一个有效的算法对字符串列表的模糊重复数据删除?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆