有意义的Javascript模糊搜索 [英] Javascript fuzzy search that makes sense

查看：152 发布时间：2019/1/22 21:26:21 javascript regex pattern-matching string-matching fuzzy-search

本文介绍了有意义的Javascript模糊搜索的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在寻找一个模糊搜索JavaScript库来过滤数组。我尝试过使用 fuzzyset.js 和 fuse.js ，但结果很糟糕（你可以尝试在链接页面上进行演示）。

I'm looking for a fuzzy search JavaScript library to filter an array. I've tried using fuzzyset.js and fuse.js, but the results are terrible (there are demos you can try on the linked pages).

在做了一些阅读之后Levenshtein距离，它让我感觉不像用户在打字时所寻找的东西。对于那些不知道的人，系统会计算需要多少次插入，删除和替换才能使两个字符串匹配。

After doing some reading on Levenshtein distance, it strikes me as a poor approximation of what users are looking for when they type. For those who don't know, the system calculates how many insertions, deletions, and substitutions are needed to make two strings match.

在Levenshtein-Demerau模型中修复的一个明显缺陷是 blub 和 boob 被认为是同等的类似于灯泡（每个需要两次替换）。但很明显，灯泡与 blub 更相似，而不是 boob ，我刚才提到的模型通过允许 transpositions 。

One obvious flaw, which is fixed in the Levenshtein-Demerau model, is that both blub and boob are considered equally similar to bulb (each requiring two substitutions). It is clear, however, that bulb is more similar to blub than boob is, and the model I just mentioned recognizes that by allowing for transpositions.

我想在文本完成的上下文中使用它，所以如果我有一个数组 ['国际'，'splint'，'tinder'] ，我的查询是 int ，我认为 international 的排名应该高于夹板，即使前者得分（更高=更差）为10而后者为3。

I want to use this in the context of text completion, so if I have an array ['international', 'splint', 'tinder'], and my query is int, I think international ought to rank more highly than splint, even though the former has a score (higher=worse) of 10 versus the latter's 3.

所以我正在寻找（并且将会如果它不存在则创建），是一个执行以下操作的库：

So what I'm looking for (and will create if it doesn't exist), is a library that does the following:

权衡不同的文本操作

每个操作的权重取决于它们出现在单词中的位置（早期操作比后期操作更昂贵）

返回按相关性排序的结果列表

有没有人来过oss什么都这样？我意识到StackOverflow不是要求软件推荐的地方，但上面隐含的（不再是！）是：我正在考虑这个方法吗？

Has anyone come across anything like this? I realize that StackOverflow isn't the place to be asking for software recommendations, but implicit (not anymore!) in the above is: am I thinking about this the right way?

我找到了好文章（pdf）。一些注释和摘录：

I found a good paper (pdf) on the subject. Some notes and excerpts:

仿射编辑距离函数为插入或删除序列分配相对较低的成本

Affine edit-distance functions assign a relatively lower cost to a sequence of insertions or deletions

Monger-Elkan距离函数（Monge& Elkan 1996），它是Smith-Waterman距离函数的一个仿射变体（Durban et al.1998），具有特定的成本参数

the Monger-Elkan distance function (Monge & Elkan 1996), which is an affine variant of the Smith-Waterman distance function (Durban et al. 1998) with particular cost parameters

Smith-Waterman距离（维基百科），Smith-Waterman算法不是查看总序列，而是比较所有可能长度的段，并优化相似性度量。这是n-gram方法。

For the Smith-Waterman distance (wikipedia), "Instead of looking at the total sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure." It's the n-gram approach.

一个大致相似的指标，不是基于编辑距离模型，是
Jaro metric（Jaro 1995; 1989; Winkler
1999）。在记录链接文献中，使用此方法的变体获得了良好的结果，该方法基于两个字符串之间的共同字符的数量和顺序。

A broadly similar metric, which is not based on an edit-distance model, is the Jaro metric (Jaro 1995; 1989; Winkler 1999). In the record-linkage literature, good results have been obtained using variants of this method, which is based on the number and order of the common characters between two strings.

A Winkler（1999）的变体也使用最长公共前缀的长度P

A variant of this due to Winkler (1999) also uses the length P of the longest common prefix

（似乎主要用于短字符串）

(seem to be intended primarily for short strings)

对于文本完成目的，Monger-Elkan和Jaro-Winkler方法似乎最有意义。 Winkler对Jaro指标的补充有效地加重了单词的开头。而Monger-Elkan的仿射方面意味着完成一个单词的必要性（这只是一系列的补充）不会太过不喜欢它。

For text completion purposes, the Monger-Elkan and Jaro-Winkler approaches seem to make the most sense. Winkler's addition to the Jaro metric effectively weights the beginnings of words more heavily. And the affine aspect of Monger-Elkan means that the necessity to complete a word (which is simply a sequence of additions) won't disfavor it too heavily.

结论：

TFIDF
排名在几个中表现最佳基于令牌的距离
指标，Monge和Elkan提出的调整的仿射间隙编辑距离指标在几个
字符串编辑距离指标中表现最佳。一个令人惊讶的好距离
metric是一个快速的启发式方案，由Jaro提出，后来由Winkler扩展。
这几乎和Monge-Elkan方案一样好，但是
要快一个数量级。
组合TFIDF方法和
Jaro-Winkler的一种简单方法是将
TFIDF中使用的确切令牌匹配替换为基于Jaro-
Winkler方案的近似令牌匹配。这种组合平均比Jaro-Winkler或TFIDF略好，并且偶尔表现得更好。它与本文考虑的几个最佳指标
的学习组合的表现也很接近。

the TFIDF ranking performed best among several token-based distance metrics, and a tuned affine-gap edit-distance metric proposed by Monge and Elkan performed best among several string edit-distance metrics. A surprisingly good distance metric is a fast heuristic scheme, proposed by Jaro and later extended by Winkler. This works almost as well as the Monge-Elkan scheme, but is an order of magnitude faster. One simple way of combining the TFIDF method and the Jaro-Winkler is to replace the exact token matches used in TFIDF with approximate token matches based on the Jaro- Winkler scheme. This combination performs slightly better than either Jaro-Winkler or TFIDF on average, and occasionally performs much better. It is also close in performance to a learned combination of several of the best metrics considered in this paper.

有意义的Javascript模糊搜索 [英] Javascript fuzzy search that makes sense

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

有意义的Javascript模糊搜索 [英] Javascript fuzzy search that makes sense

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭