有意义的Javascript模糊搜索 [英] Javascript fuzzy search that makes sense

查看:152
本文介绍了有意义的Javascript模糊搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个模糊搜索JavaScript库来过滤数组。我尝试过使用 fuzzyset.js fuse.js ,但结果很糟糕(你可以尝试在链接页面上进行演示)。

I'm looking for a fuzzy search JavaScript library to filter an array. I've tried using fuzzyset.js and fuse.js, but the results are terrible (there are demos you can try on the linked pages).

在做了一些阅读之后Levenshtein距离,它让我感觉不像用户在打字时所寻找的东西。对于那些不知道的人,系统会计算需要多少次插入删除替换才能使两个字符串匹配。

After doing some reading on Levenshtein distance, it strikes me as a poor approximation of what users are looking for when they type. For those who don't know, the system calculates how many insertions, deletions, and substitutions are needed to make two strings match.

在Levenshtein-Demerau模型中修复的一个明显缺陷是 blub boob 被认为是同等的类似于灯泡(每个需要两次替换)。但很明显,灯泡 blub 更相似,而不是 boob ,我刚才提到的模型通过允许 transpositions

One obvious flaw, which is fixed in the Levenshtein-Demerau model, is that both blub and boob are considered equally similar to bulb (each requiring two substitutions). It is clear, however, that bulb is more similar to blub than boob is, and the model I just mentioned recognizes that by allowing for transpositions.

我想在文本完成的上下文中使用它,所以如果我有一个数组 ['国际','splint','tinder'] ,我的查询是 int ,我认为 international 的排名应该高于夹板,即使前者得分(更高=更差)为10而后者为3。

I want to use this in the context of text completion, so if I have an array ['international', 'splint', 'tinder'], and my query is int, I think international ought to rank more highly than splint, even though the former has a score (higher=worse) of 10 versus the latter's 3.

所以我正在寻找(并且将会如果它不存在则创建),是一个执行以下操作的库:

So what I'm looking for (and will create if it doesn't exist), is a library that does the following:


  • 权衡不同的文本操作

  • 每个操作的权重取决于它们出现在单词中的位置(早期操作比后期操作更昂贵)

  • 返回按相关性排序的结果列表

有没有人来过oss什么都这样?我意识到StackOverflow不是要求软件推荐的地方,但上面隐含的(不再是!)是:我正在考虑这个方法吗?

Has anyone come across anything like this? I realize that StackOverflow isn't the place to be asking for software recommendations, but implicit (not anymore!) in the above is: am I thinking about this the right way?

我找到了好文章(pdf)。一些注释和摘录:

I found a good paper (pdf) on the subject. Some notes and excerpts:


仿射编辑距离函数为插入或删除序列分配相对较低的成本

Affine edit-distance functions assign a relatively lower cost to a sequence of insertions or deletions

Monger-Elkan距离函数(Monge& Elkan 1996),它是Smith-Waterman距离函数的一个仿射变体(Durban et al.1998),具有特定的成本参数

the Monger-Elkan distance function (Monge & Elkan 1996), which is an affine variant of the Smith-Waterman distance function (Durban et al. 1998) with particular cost parameters

Smith-Waterman距离(维基百科),Smith-Waterman算法不是查看总序列,而是比较所有可能长度的段,并优化相似性度量。这是n-gram方法。

For the Smith-Waterman distance (wikipedia), "Instead of looking at the total sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure." It's the n-gram approach.


一个大致相似的指标,不是基于编辑距离模型,是
Jaro metric(Jaro 1995; 1989; Winkler
1999)。在记录链接文献中,使用此方法的变体获得了良好的结果,该方法基于两个字符串之间的共同字符的数量和顺序。

A broadly similar metric, which is not based on an edit-distance model, is the Jaro metric (Jaro 1995; 1989; Winkler 1999). In the record-linkage literature, good results have been obtained using variants of this method, which is based on the number and order of the common characters between two strings.

A Winkler(1999)的变体也使用最长公共前缀的长度P

A variant of this due to Winkler (1999) also uses the length P of the longest common prefix

(似乎主要用于短字符串)

(seem to be intended primarily for short strings)

对于文本完成目的,Monger-Elkan和Jaro-Winkler方法似乎最有意义。 Winkler对Jaro指标的补充有效地加重了单词的开头。而Monger-Elkan的仿射方面意味着完成一个单词的必要性(这只是一系列的补充)不会太过不喜欢它。

For text completion purposes, the Monger-Elkan and Jaro-Winkler approaches seem to make the most sense. Winkler's addition to the Jaro metric effectively weights the beginnings of words more heavily. And the affine aspect of Monger-Elkan means that the necessity to complete a word (which is simply a sequence of additions) won't disfavor it too heavily.

结论:


TFIDF
排名在几个中表现最佳基于令牌的距离
指标,Monge和Elkan提出的调整的仿射间隙编辑距离指标在几个
字符串编辑距离指标中表现最佳。一个令人惊讶的好距离
metric是一个快速的启发式方案,由Jaro提出,后来由Winkler扩展。
这几乎和Monge-Elkan方案一样好,但是
要快一个数量级。
组合TFIDF方法和
Jaro-Winkler的一种简单方法是将
TFIDF中使用的确切令牌匹配替换为基于Jaro-
Winkler方案的近似令牌匹配。这种组合平均比Jaro-Winkler或TFIDF略好,并且偶尔表现得更好。它与本文考虑的几个最佳指标
的学习组合的表现也很接近。

the TFIDF ranking performed best among several token-based distance metrics, and a tuned affine-gap edit-distance metric proposed by Monge and Elkan performed best among several string edit-distance metrics. A surprisingly good distance metric is a fast heuristic scheme, proposed by Jaro and later extended by Winkler. This works almost as well as the Monge-Elkan scheme, but is an order of magnitude faster. One simple way of combining the TFIDF method and the Jaro-Winkler is to replace the exact token matches used in TFIDF with approximate token matches based on the Jaro- Winkler scheme. This combination performs slightly better than either Jaro-Winkler or TFIDF on average, and occasionally performs much better. It is also close in performance to a learned combination of several of the best metrics considered in this paper.


推荐答案

好问题!但我的想法是,不是试图修改Levenshtein-Demerau,你可能最好尝试不同的算法或组合/加权两种算法的结果。

Good question! But my thought is that, rather than trying to modify Levenshtein-Demerau, you might be better to try a different algorithm or combine/ weight the results from two algorithms.

It让我觉得精确或接近开始前缀的匹配是Levenshtein-Demerau没有给予特别重视的东西 - 但是你明显的用户期望会。

It strikes me that exact or close matches to the "starting prefix" are something Levenshtein-Demerau gives no particular weight to -- but your apparent user expectations would.

我搜索了比Levenshtein更好,除此之外,发现了这个:

I searched for "better than Levenshtein" and, among other things, found this:

http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/

这个提到了一些字符串距离措施。三个看起来与您的要求特别相关的是:

This mentions a number of "string distance" measures. Three which looked particularly relevant to your requirement, would be:


  1. 最长公共子串距离最小值两个字符串中必须删除的符号数,直到生成的子字符串相同。

  1. Longest Common Substring distance: Minimum number of symbols that have to be removed in both strings until resulting substrings are identical.

q-gram距离:绝对值之和两个字符串的N-gram向量之间的差异。

q-gram distance: Sum of absolute differences between N-gram vectors of both strings.

Jaccard距离: 1分摊共享N-gram的商数和所有观察到的商数N-gram。

Jaccard distance: 1 minues the quotient of shared N-grams and all observed N-grams.

也许你可以使用Levenshtein的这些指标的加权组合(或最小值) - 常见的子字符串,普通的N-gram或Jaccard都非常喜欢相似的字符串 - 或者只是尝试使用Jaccard?

Maybe you could use a weighted combination (or minimum) of these metrics, with Levenshtein -- common substring, common N-gram or Jaccard will all strongly prefer similar strings -- or perhaps try just using Jaccard?

取决于大小在您的列表/数据库中,这些算法可能会非常昂贵。对于我实现的模糊搜索,我使用可配置数量的N-gram作为数据库中的检索键然后运行昂贵的字符串距离度量以按优先顺序对它们进行排序。

Depending on the size of your list/ database, these algorithms can be moderately expensive. For a fuzzy search I implemented, I used a configurable number of N-grams as "retrieval keys" from the DB then ran the expensive string-distance measure to sort them in preference order.

我在SQL中写了一些关于模糊字符串搜索的注释。请参阅:

I wrote some notes on Fuzzy String Search in SQL. See:

  • http://literatejava.com/sql/fuzzy-string-search-sql/

这篇关于有意义的Javascript模糊搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆