如何有效地在JavaScript中的唯一字符串中找到相似的字符串? [英] How to efficiently find similar strings in a unique string in JavaScript?
问题描述
背景:我有一个列表,其中包含13,000个姓氏记录,其中一些是重复的,我想找出类似的名称来进行手动重复过程。
Background: I have a list that contains 13,000 records of human names, some of them are duplicates and I want to find out the similar ones to do the manual duplication process.
对于像这样的数组:
["jeff","Jeff","mandy","king","queen"]
什么是有效的获取方式:
What would be an efficient way to get:
[["jeff","Jeff"]]
说明 [ jeff, Jeff]
,因为它们的Levenshtein距离为1(可以像3那样变化)。
Explanation ["jeff","Jeff"]
since their Levenshtein distance is 1(which can be variable like 3).
/*
Working but a slow solution
*/
function extractSimilarNames(uniqueNames) {
let similarNamesGroup = [];
for (let i = 0; i < uniqueNames.length; i++) {
//compare with the rest of the array
const currentName = uniqueNames[i];
let suspiciousNames = [];
for (let j = i + 1; j < uniqueNames.length; j++) {
const matchingName = uniqueNames[j];
if (isInLevenshteinRange(currentName, matchingName, 1)) {
suspiciousNames.push(matchingName);
removeElementFromArray(uniqueNames, matchingName);
removeElementFromArray(uniqueNames, currentName);
i--;
j--;
}
}
if (suspiciousNames.length > 0) {
suspiciousNames.push(currentName);
}
}
return similarNamesGroup;
}
我想通过Levenshtein距离查找相似度,而不仅是小写/大写相似性
I want to find the similarity via Levenshtein distance, not only the lower/uppercase similarity
我已经找到了最快的Levenshtein之一实现,但仍然需要35分钟才能得到13000个项目列表的结果。
I already find one of the fastest Levenshtein implementation but it still takes me to 35 mins to get the result of 13000 items list.
推荐答案
您的问题是而不是Levenshtein距离实施的速度。您的问题是您必须将每个单词相互比较。这意味着您进行13000²比较(并每次计算Levenshtein距离)。
Your problem is not the speed of the Levenshtein distance implementation. Your problem is that you have to compare each word with each other word. This means you make 13000² comparisons (and each time calculate the Levenshtein distance).
所以我的方法是尝试减少比较次数。
So my approach would be to try to reduce the number of comparisons.
这里有一些想法:
-
单词只有在长度相差小于20%(仅是我的估计)
→我们可以按长度分组,仅将单词与长度为±20%的其他单词进行比较
words are only similar if their lengths differ less than 20% (just my estimation)
→ we can group by length and only compare words with other words of length ±20%
单词只有在它们共享很多字母的情况下才相似
→我们可以创建例如3克(全部小写)表示它们所包含的词。
→仅将一个词与其他单词相比较(例如,与Levenshtein距离比较),这些词具有几个3克的共同点。
words are only similar if they share a lot of letters
→ we can create a list of e.g. 3-grams (all lower case) that refer to the words they are part of.
→ only compare (e.g. with Levenshtein distance) a word with other words that have several 3-grams in common with it.
这篇关于如何有效地在JavaScript中的唯一字符串中找到相似的字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!