如何有效地在JavaScript中的唯一字符串中找到相似的字符串? [英] How to efficiently find similar strings in a unique string in JavaScript?

查看:107
本文介绍了如何有效地在JavaScript中的唯一字符串中找到相似的字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景:我有一个列表,其中包含13,000个姓氏记录,其中一些是重复的,我想找出类似的名称来进行手动重复过程。

Background: I have a list that contains 13,000 records of human names, some of them are duplicates and I want to find out the similar ones to do the manual duplication process.

对于像这样的数组:

["jeff","Jeff","mandy","king","queen"] 

什么是有效的获取方式:

What would be an efficient way to get:

[["jeff","Jeff"]]

说明 [ jeff, Jeff] ,因为它们的Levenshtein距离为1(可以像3那样变化)。

Explanation ["jeff","Jeff"] since their Levenshtein distance is 1(which can be variable like 3).

/* 
Working but a slow solution
*/
function extractSimilarNames(uniqueNames) {
  let similarNamesGroup = [];

  for (let i = 0; i < uniqueNames.length; i++) {
    //compare with the rest of the array
    const currentName = uniqueNames[i];

    let suspiciousNames = [];

    for (let j = i + 1; j < uniqueNames.length; j++) {
      const matchingName = uniqueNames[j];
      if (isInLevenshteinRange(currentName, matchingName, 1)) {
        suspiciousNames.push(matchingName);
        removeElementFromArray(uniqueNames, matchingName);
        removeElementFromArray(uniqueNames, currentName);
        i--;
        j--;
      }
    }
    if (suspiciousNames.length > 0) {
      suspiciousNames.push(currentName);
    }
  }
  return similarNamesGroup;
}

我想通过Levenshtein距离查找相似度,而不仅是小写/大写相似性

I want to find the similarity via Levenshtein distance, not only the lower/uppercase similarity

我已经找到了最快的Levenshtein之一实现,但仍然需要35分钟才能得到13000个项目列表的结果。

I already find one of the fastest Levenshtein implementation but it still takes me to 35 mins to get the result of 13000 items list.

推荐答案

您的问题是而不是Levenshtein距离实施的速度。您的问题是您必须将每个单词相互比较。这意味着您进行13000²比较(并每次计算Levenshtein距离)。

Your problem is not the speed of the Levenshtein distance implementation. Your problem is that you have to compare each word with each other word. This means you make 13000² comparisons (and each time calculate the Levenshtein distance).

所以我的方法是尝试减少比较次数。

So my approach would be to try to reduce the number of comparisons.

这里有一些想法:


  • 单词只有在长度相差小于20%(仅是我的估计)

    →我们可以按长度分组,仅将单词与长度为±20%的其他单词进行比较

  • words are only similar if their lengths differ less than 20% (just my estimation)
    → we can group by length and only compare words with other words of length ±20%

单词只有在它们共享很多字母的情况下才相似

→我们可以创建例如3克(全部小写)表示它们所包含的词。

→仅将一个词与其他单词相比较(例如,与Levenshtein距离比较),这些词具有几个3克的共同点。

words are only similar if they share a lot of letters
→ we can create a list of e.g. 3-grams (all lower case) that refer to the words they are part of.
→ only compare (e.g. with Levenshtein distance) a word with other words that have several 3-grams in common with it.

这篇关于如何有效地在JavaScript中的唯一字符串中找到相似的字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆