如何模糊搜索字典中的单词? [英] How to fuzzily search for a dictionary word?

查看:453
本文介绍了如何模糊搜索字典中的单词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经阅读了大量的线程在这里讨论编辑距离基于模糊搜索,这就像Elasticsearch / Lucene的工具提供开箱即用的,但我的问题是有点不同。假设我有一本字典的话,{'猫','婴儿床','催化剂'}和性格相似关系F(X,Y)

I have read a lot of threads here discussing edit-distance based fuzzy-searches, which tools like Elasticsearch/Lucene provide out of the box, but my problem is a bit different. Suppose I have a dictionary of words, {'cat', 'cot', 'catalyst'}, and a character similarity relation f(x, y)

f(x, y) = 1, if characters x and y are similar
        = 0, otherwise

(这些相似性可以由程序员指定)

(These "similarities" can be specified by the programmer)

这样,比方说,

f('t', 'l') = 1
f('a', 'o') = 1
f('f', 't') = 1

不过,

f('a', 'z') = 0
etc.

现在,如果我们有一个查询cofatyst',算法应报告下列匹配:

Now if we have a query 'cofatyst', the algorithm should report the following matches:

('cot', 0)
('cat', 0)
('catalyst', 0)

其中的数字发现,本场比赛的0为基础的起始索引。我曾尝试阿霍Corasick算法,虽然它的伟大工程精确匹配和在当一个字符具有类似的字符数相对少的情况下,它的性能呈指数下降,因为我们增加的相似的字符数的字符。任何人都可以点我这样做的更好的办法?模糊性是绝对必要的,它必须采取到帐户字符的相似性(即不一味依靠只需编辑-距离)。

where the number is the 0-based starting index of the match found. I have tried the Aho-Corasick algorithm, and while it works great for exact matching and in the case when a character has relatively less number of "similar" characters, its performance drops exponentially as we increase the number of similar characters for a character. Can anyone point me to a better way of doing this? Fuzziness is an absolute necessity, and it must take in to account character similarities(i.e., not blindly depend on just edit-distances).

有一点需要注意的是,在野外,词典将是真正的大。

One thing to note is that in the wild, the dictionary is going to be really large.

推荐答案

我可能会尝试使用每个字符作为一个特征的位置和映射使用基于你的角色关系匹配功能特点的产品使用余弦相似。

I might try to use the cosine similarity using the position of each character as a feature and mapping the product between features using a match function based on your character relations.

不是一个非常具体的提醒,我知道,但我希望它可以帮助你。

Not a very specific advise, I know, but I hope it helps you.

编辑:扩展的答案

随着余弦相似,你将计算两个向量是多么相似。你的情况正常化可能没有意义。所以,我会做的是一件很简单的(我可能是过于简单化的问题):首先,看看CXC的矩阵的概率是两个字符是相关的(例如,P(T的依赖矩阵|'L' )= 1)。这也将让您有部分依赖完善和部分匹配区分。在此之后,我会计算,每个位置的概率,从每个单词的字母是不一样的(用P(t_i,t_j)的补码),然后你可以使用聚合一笔的结果。

With the cosine similarity, you will compute how similar two vectors are. In your case the normalisation might not make sense. So, what I would do is something very simple (I might be oversimplifying the problem): First, see the matrix of CxC as a dependency matrix with the probability that two characters are related (e.g., P('t' | 'l') = 1). This will also allow you to have partial dependencies to differentiate between perfect and partial matches. After this I will compute, for each position the probability that the letter from each word is not the same (using the complement of P(t_i, t_j)) and then you can just aggregate the results using a sum.

这将计算而言是一个特定的单词对不同的号码,它允许您定义部分依赖。此外,该实施是非常简单,应该很好地扩展。这就是为什么我不知道如果我误解了你的问题。

It will count the number of terms that are different for a specific pair of words, and it allows you to define partial dependencies. Furthermore, the implementation is very simple and should scale well. This is why I am not sure if I misunderstood your question.

这篇关于如何模糊搜索字典中的单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆