如何使用Levenshtein距离为类似的字符串创建阈值并解决拼写错误? [英] How can I create a threshold for similar strings using Levenshtein distance and account for typos?

查看:161
本文介绍了如何使用Levenshtein距离为类似的字符串创建阈值并解决拼写错误?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们最近在工作中遇到了一个有趣的问题,我们在数据库中发现了重复的用户提交的数据.我们意识到,大多数数据之间的Levenshtein距离仅仅是所讨论的两个字符串之间的差异.这表明,如果我们仅将一个字符串中的字符添加到另一个字符串中,那么我们最终得到相同的字符串,对于大多数情况,这似乎是我们考虑重复项的最佳方法.

We recently encountered an interesting problem at work where we discovered duplicate user submitted data in our database. We realized that the Levenshtein distance between most of this data was simply the difference between the 2 strings in question. That indicates that if we simply add characters from one string into the other then we end up with the same string, and for most things this seems like the best way for us to account for items that are duplicate.

我们也要考虑拼写错误.因此,我们开始思考人们平均每个单词在网上打错字的频率,并尝试在此距离内使用该数据.我们找不到任何这样的统计数据.

We also want to account for typos. So we started to think about on average how often do people make typos online per word, and try to use that data within this distance. We could not find any such statistic.

在为数据匹配创建这种阈值时,是否有任何方法可以解决拼写错误?

Is there any way to account for typos when creating this sort of threshold for a match of data?

让我知道是否可以澄清!

Let me know if I can clarify!

推荐答案

首先,Levenshtein距离定义为将字符串A转换为字符串B所需的最小编辑次数,其中编辑是插入或删除a单个字符,或将一个字符替换为另一个字符.因此,对于距离的特定定义来说,它很大程度上是两个字符串之间的差异". =)

First off, Levenshtein distance is defined as the minimum number of edits required to transform string A to string B, where an edit is the insertion, or deletion of a single character, or the replacement of a character with another character. So it's very much the "difference between two strings", for a certain definition of distance. =)

听起来您正在寻找距离函数F(A,B),该函数给出了字符串A和B之间的距离以及阈值N,其中彼此之间的距离小于N的字符串是拼写错误的候选者.除了Levenshtein距离外,您还可以考虑 Needleman–Wunsch .基本上是同一件事,但是它使您可以提供一个函数,以了解给定字符与另一个字符的距离.您可以将该算法与一组权重一起使用,这些权重可以反映QWERTY键盘上按键的位置,从而可以很好地找到错别字.不过,这会与国际键盘有关.

It sounds like you're looking for a distance function F(A, B) that gives a distance between strings A and B and a threshold N where strings with distance less than N from each other are candidates for typos. In addition to Levenshtein distance you might also consider Needleman–Wunsch. It's basically the same thing but it lets you provide a function for how close a given character is to another character. You could use that algorithm with a set of weights that reflect the positions of keys on a QWERTY keyboard to do a pretty good job of finding typos. This would have issues with international keyboards though.

如果您有k个字符串,并且想查找潜在的错别字,则需要进行的比较次数为O(k ^ 2).此外,每个比较都是O(len(A)* len(B)).因此,如果您有一百万个字符串,那么如果您天真地做事,就会发现自己遇到了麻烦.以下是一些有关如何加快速度的建议:

If you have k strings and you want to find potential typos, the number of comparisons you need to make is O(k^2). In addition, each comparison is O(len(A)*len(B)). So if you have a million strings you're going to find yourself in trouble if you do things naively. Here are a few suggestions on how to speed things up:

  • 很抱歉,这很明显,但是Levenshtein距离是对称的,因此请确保您未计算F(A,B)和F(B,A).
  • abs(len(A)-len(B))是字符串A和B之间距离的下限.因此,您可以跳过检查长度过长的字符串.

您可能会遇到的一个问题是第一街".与第一街"相距甚远,即使您可能希望将它们视为相同.处理此问题的最简单方法可能是在进行比较之前将字符串转换为规范形式.因此,您可以将所有字符串都转换为小写,使用将"1st"映射到"first"的字典,等等.该字典可能会变得很大,但是我不知道解决此问题的更好方法.

One issue you might run into is that "1st St." has a pretty high distance from "First Street", even though you probably want to consider those to be identical. The easiest way to handle this is probably to transform strings into a canonical form before doing the comparisons. So you might make all strings lowercase, use a dictionary that maps "1st" to "first", etc. That dictionary might get pretty big, but I don't know a better way to deal with this issues.

由于您使用php标记了此问题,所以我假设您要为此使用php. PHP具有内置的levenshtein()函数,但两个字符串的长度均不得超过255个字符.如果时间不够长,您必须自己制作.另外,您可以使用Python的difflib进行调查.

Since you tagged this question with php, I'm assuming you want to use php for this. PHP has a built-in levenshtein() function but both strings have to be 255 characters or less. If that's not long enough you'll have to make your own. Alternatively, you investigate using Python's difflib.

这篇关于如何使用Levenshtein距离为类似的字符串创建阈值并解决拼写错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆