使用LSH进行近似字符串匹配 [英] Approximate String Matching using LSH

查看:614
本文介绍了使用LSH进行近似字符串匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用局部性敏感散列大致匹配字符串。我有许多可能包含拼写错误的大于10M的字符串。对于每个字符串,我想与所有其他字符串进行比较,并根据某个阈值选择具有编辑距离的字符串。

也就是说,天真的解决方案需要O(n ^ 2)比较。为了避免这个问题,我想使用局部敏感散列。然后接近类似的字符串将导致相同的桶,我只需要在桶搜索内进行。所以它是O(n * C),其中C是桶的大小。



然而,我不明白如何表示字符串。如果它是我在矢量空间中表示的文本。我的主要问题是,如果这是易于使用LSH,然后适当的矢量表示字符串。

我能够使用一个已经实现的库来完成此任务吗?或者这取决于我的问题,所以我必须自己实现它?有没有这样的python包?

解决方案

我在这个主题上找到的最好的学术资源是海量数据挖掘的第3章,它给出了一个关于局部敏感哈希和minhashing 。

非常简单地说,想法是采用几个字符串,对这些字符串进行矢量化,然后将滑动窗口传递给结果矢量。如果两个向量在同一个窗口位置具有相同的值,则将它们标记为候选进行更细粒度的相似性分析。



Python数据库库中有一个很好的实现 pip install datasketch )。下面是一个例子,说明你可以捕获模糊字符串的相似性:

  from datasketch import MinHash,MinHashLSH 
from nltk import ngrams

data = ['minhash是估计数据集之间相似性的概率数据结构',
'finhash dis fa frobabilistic fata ftructure for festimating the fimilarity fetween fatasets',
'权重控制最小化误报'之间的相对重要性,
'wfights控制最小化fflse posftive',
]

之间的rflative ifportance之间的相对重要性创建为Jaccard阈值0.5优化的MinHashLSH索引,
#接受具有128个排列函数的MinHash对象
lsh = MinHashLSH(threshold = 0.5,num_perm = 128)

#创建MinHash对象
minhashes = {}
for c,i in枚举(数据):
minhash = MinHash(num_perm = 128)
for ngrams(i,3):
minhash.update( .join(d).encode('utf-8'))
lsh.insert( (len(minhashes.keys())):
minhashes [c] = minhash

(len(minhashes.keys())):
result = lsh.query(minhashes [i ])
print具有Jaccard相似性的候选人> 0.5 for input,i,:,result

返回:

 具有Jaccard相似度的候选人>输入0的值为0.5:[0,1] 
具有Jaccard相似性的候选人>输入1的值为0.5: 0,1]
输入2的Jaccard相似度> 0.5的候选者:[2,3]
输入3的Jaccard相似度> 0.5的候选者:[2,3]


I would like to approximately match Strings using Locality sensitive hashing. I have many Strings>10M that may contain typos. For every String I would like to make a comparison with all the other strings and select those with an edit distance according to some threshold.

That is, the naive solution requires O(n^2) comparisons. In order to avoid that issue I was thinking of using Locality Sensitive Hashing. Then near similar strings would result to the same buckets and I need to do only inside bucket search. So it is O(n*C) where C is the bucket size.

However, I do not understand how to represent the strings. If it was text I would represented in vector space. My main question is if this is tractable using LSH and then an appropriate vector representation of the string.

Am I able to use an already implemented library for this task? or it depends on my problem so I must implement it myself? Is there any python package that does this?

解决方案

The best academic resource I've found on the subject is Chapter 3 of Mining of Massive Datasets, which gives an awesome overview of locality sensitive hashing and minhashing.

Very briefly, the idea is to take several strings, vectorize those strings, then pass a sliding window over the resulting vectors. If two vectors have the same value in the same window position, mark them as candidates for more fine-grained similarity analysis.

There's a great implementation in the Python datasketch library (pip install datasketch). Here's an example that shows you can catch fuzzy string similarity:

from datasketch import MinHash, MinHashLSH
from nltk import ngrams

data = ['minhash is a probabilistic data structure for estimating the similarity between datasets',
  'finhash dis fa frobabilistic fata ftructure for festimating the fimilarity fetween fatasets',
  'weights controls the relative importance between minizing false positive',
  'wfights cfntrols the rflative ifportance befween minizing fflse posftive',
]

# Create an MinHashLSH index optimized for Jaccard threshold 0.5,
# that accepts MinHash objects with 128 permutations functions
lsh = MinHashLSH(threshold=0.5, num_perm=128)

# Create MinHash objects
minhashes = {}
for c, i in enumerate(data):
  minhash = MinHash(num_perm=128)
  for d in ngrams(i, 3):
    minhash.update("".join(d).encode('utf-8'))
  lsh.insert(c, minhash)
  minhashes[c] = minhash

for i in xrange(len(minhashes.keys())):
  result = lsh.query(minhashes[i])
  print "Candidates with Jaccard similarity > 0.5 for input", i, ":", result

This returns:

Candidates with Jaccard similarity > 0.5 for input 0 : [0, 1]
Candidates with Jaccard similarity > 0.5 for input 1 : [0, 1]
Candidates with Jaccard similarity > 0.5 for input 2 : [2, 3]
Candidates with Jaccard similarity > 0.5 for input 3 : [2, 3]

这篇关于使用LSH进行近似字符串匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆