获得最接近的字符串匹配 [英] Getting the closest string match

查看:108
本文介绍了获得最接近的字符串匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一种方法来多个字符串比较测试字符串,返回近似于它的字符串:

I need a way to compare multiple strings to a test string and return the string that closely resembles it:

TEST STRING: THE BROWN FOX JUMPED OVER THE RED COW

CHOICE A   : THE RED COW JUMPED OVER THE GREEN CHICKEN
CHOICE B   : THE RED COW JUMPED OVER THE RED COW
CHOICE C   : THE RED FOX JUMPED OVER THE BROWN COW

(如果我正确地做这个)最近的字符串为测试字符串应该是选项C。什么是做到这一点的最简单的方法是什么?

(If I did this correctly) The closest string to the "TEST STRING" should be "CHOICE C". What is the easiest way to do this?

我计划在实施这一成多国语言,包括VB.net,Lua中,和JavaScript。此时,伪code是可以接受的。如果你能提供一个例子为特定的语言,这是pciated太AP $ P $!

I plan on implementing this into multiple languages including VB.net, Lua, and JavaScript. At this point, pseudo code is acceptable. If you can provide an example for a specific language, this is appreciated too!

推荐答案

我进入psented用大约一年前,当它来到这个问题来查找用户$ P $关于石油钻机杂项信息的数据库信息。我们的目标是做某种模糊字符串搜索可识别数据库条目中最常见的元素。

I was presented with this problem about a year ago when it came to looking up user entered information about a oil rig in a database of miscellaneous information. The goal was to do some sort of fuzzy string search that could identify the database entry with the most common elements.

部分研究涉及实施 Levenshtein距离的算法,它决定了必须有多少的变化做出一个字符串或短语把它变成另一个字符串或短语。

Part of the research involved implementing the Levenshtein distance algorithm, which determines how many changes must be made to a string or phrase to turn it into another string or phrase.

我想出的实现相对简单,涉及的两个短语的长度,每个词组之间的变化数量的加权比较,以及是否每个字可在目标条目中找到。

The implementation I came up with was relatively simple, and involved a weighted comparison of the length of the two phrases, the number of changes between each phrase, and whether each word could be found in the target entry.

这篇文章是在一个私人网站,所以我会尽我所能来这里追加的相关内容:

The article is on a private site so I'll do my best to append the relevant contents here:

模糊串匹配是进行人样估计两个词或短语的相似性的过程。在许多情况下,它包括确定哪个最相似彼此单词或短语。本文介绍了一个内部​​解决了模糊字符串匹配问题及其解决各种各样的问题,这可以让我们自动执行任务的其中previously需要繁琐的用户参与效用。

Fuzzy String Matching is the process of performing a human-like estimation of the similarity of two words or phrases. In many cases, it involves identifying words or phrases which are most similar to each other. This article describes an in-house solution to the fuzzy string matching problem and its usefulness in solving a variety of problems which can allow us to automate tasks which previously required tedious user involvement.

简介

做模糊字符串匹配的需求本来是约在开发墨西哥湾的验证工具。什么存在是墨西哥石油钻井平台,人们买保险可以给我们一些关于他们的资产严重打出来可知信息鸿沟的一个数据库,我们必须把它匹配已知平台的数据库。当被赋予的信息很少,我们可以做的最好的就是依靠承销商,以认了一把,他们指的是和调出正确的信息。这也是本次自动化的解决方案就派上用场了。

The need to do fuzzy string matching originally came about while developing the Gulf of Mexico Validator tool. What existed was a database of known gulf of Mexico oil rigs and platforms, and people buying insurance would give us some badly typed out information about their assets and we had to match it to the database of known platforms. When there was very little information given, the best we could do is rely on an underwriter to "recognize" the one they were referring to and call up the proper information. This is where this automated solution comes in handy.

我花了一天时间研究的模糊字符串匹配方法,并最终偶然发现在维基百科非常有用的莱文斯坦距离算法。

I spent a day researching methods of fuzzy string matching, and eventually stumbled upon the very useful Levenshtein distance algorithm on Wikipedia.

执行

阅读关于它背后的理论后,我实现了,发现的方法来优化它。这是我的code看起来像在VBA:

After reading about the theory behind it, I implemented and found ways to optimize it. This is how my code looks like in VBA:

'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)
Public Function LevenshteinDistance(ByRef S1 As String, ByVal S2 As String) As Long
    Dim L1 As Long, L2 As Long, D() As Long 'Length of input strings and distance matrix
    Dim i As Long, j As Long, cost As Long 'loop counters and cost of substitution for current letter
    Dim cI As Long, cD As Long, cS As Long 'cost of next Insertion, Deletion and Substitution
    L1 = Len(S1): L2 = Len(S2)
    ReDim D(0 To L1, 0 To L2)
    For i = 0 To L1: D(i, 0) = i: Next i
    For j = 0 To L2: D(0, j) = j: Next j

    For j = 1 To L2
        For i = 1 To L1
            cost = Abs(StrComp(Mid$(S1, i, 1), Mid$(S2, j, 1), vbTextCompare))
            cI = D(i - 1, j) + 1
            cD = D(i, j - 1) + 1
            cS = D(i - 1, j - 1) + cost
            If cI <= cD Then 'Insertion or Substitution
                If cI <= cS Then D(i, j) = cI Else D(i, j) = cS
            Else 'Deletion or Substitution
                If cD <= cS Then D(i, j) = cD Else D(i, j) = cS
            End If
        Next i
    Next j
    LevenshteinDistance = D(L1, L2)
End Function

简单,快捷,和一个非常有用的指标。用这种方法,我创建了两个不同的指标,用于评估两个字符串的相似性。一个我称之为valuePhrase,一个我称之为valueWords。 valuePhrase只是两个短语之间的莱文斯坦距离,valueWords分割字符串为单个单词的基础上,分隔符,如空格,破折号,和其他任何你想,每一个字比较给对方一句话,总结最短莱文斯坦距离连接任何两个词。本质上,它的措施是否在一个短语中的信息确实包含在另一个,正如一个字为单位的排列。我花了几天的一个侧面项目未来与可能的分裂的基础上分隔符字符串的最有效的方式。

Simple, speedy, and a very useful metric. Using this, I created two separate metrics for evaluating the similarity of two strings. One I call "valuePhrase" and one I call "valueWords". valuePhrase is just the Levenshtein distance between the two phrases, and valueWords splits the string into individual words, based on delimiters such as spaces, dashes, and anything else you'd like, and compares each word to each other word, summing up the shortest Levenshtein distance connecting any two words. Essentially, it measures whether the information in one 'phrase' is really contained in another, just as a word-wise permutation. I spent a few days as a side project coming up with the most efficient way possible of splitting a string based on delimiters.

valueWords,valuePhrase和斯普利特功能:

valueWords, valuePhrase, and Split function:

Public Function valuePhrase#(ByRef S1$, ByRef S2$)
    valuePhrase = LevenshteinDistance(S1, S2)
End Function

Public Function valueWords#(ByRef S1$, ByRef S2$)
    Dim wordsS1$(), wordsS2$()
    wordsS1 = SplitMultiDelims(S1, " _-")
    wordsS2 = SplitMultiDelims(S2, " _-")
    Dim word1%, word2%, thisD#, wordbest#
    Dim wordsTotal#
    For word1 = LBound(wordsS1) To UBound(wordsS1)
        wordbest = Len(S2)
        For word2 = LBound(wordsS2) To UBound(wordsS2)
            thisD = LevenshteinDistance(wordsS1(word1), wordsS2(word2))
            If thisD < wordbest Then wordbest = thisD
            If thisD = 0 Then GoTo foundbest
        Next word2
foundbest:
        wordsTotal = wordsTotal + wordbest
    Next word1
    valueWords = wordsTotal
End Function

''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' SplitMultiDelims
' This function splits Text into an array of substrings, each substring
' delimited by any character in DelimChars. Only a single character
' may be a delimiter between two substrings, but DelimChars may
' contain any number of delimiter characters. It returns a single element
' array containing all of text if DelimChars is empty, or a 1 or greater
' element array if the Text is successfully split into substrings.
' If IgnoreConsecutiveDelimiters is true, empty array elements will not occur.
' If Limit greater than 0, the function will only split Text into 'Limit'
' array elements or less. The last element will contain the rest of Text.
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Function SplitMultiDelims(ByRef Text As String, ByRef DelimChars As String, _
        Optional ByVal IgnoreConsecutiveDelimiters As Boolean = False, _
        Optional ByVal Limit As Long = -1) As String()
    Dim ElemStart As Long, N As Long, M As Long, Elements As Long
    Dim lDelims As Long, lText As Long
    Dim Arr() As String

    lText = Len(Text)
    lDelims = Len(DelimChars)
    If lDelims = 0 Or lText = 0 Or Limit = 1 Then
        ReDim Arr(0 To 0)
        Arr(0) = Text
        SplitMultiDelims = Arr
        Exit Function
    End If
    ReDim Arr(0 To IIf(Limit = -1, lText - 1, Limit))

    Elements = 0: ElemStart = 1
    For N = 1 To lText
        If InStr(DelimChars, Mid(Text, N, 1)) Then
            Arr(Elements) = Mid(Text, ElemStart, N - ElemStart)
            If IgnoreConsecutiveDelimiters Then
                If Len(Arr(Elements)) > 0 Then Elements = Elements + 1
            Else
                Elements = Elements + 1
            End If
            ElemStart = N + 1
            If Elements + 1 = Limit Then Exit For
        End If
    Next N
    'Get the last token terminated by the end of the string into the array
    If ElemStart <= lText Then Arr(Elements) = Mid(Text, ElemStart)
    'Since the end of string counts as the terminating delimiter, if the last character
    'was also a delimiter, we treat the two as consecutive, and so ignore the last elemnent
    If IgnoreConsecutiveDelimiters Then If Len(Arr(Elements)) = 0 Then Elements = Elements - 1

    ReDim Preserve Arr(0 To Elements) 'Chop off unused array elements
    SplitMultiDelims = Arr
End Function

办法相似度

使用这两个指标,以及第三,它简单地计算两个串之间的距离,我有一系列的,我可以运行优化算法来实现匹配的最大数量的变量。模糊字符串匹配时,本身就是一个模糊的科学,因此通过测量串相似性建立线性无关的指标,并具有一组已知的字符串,我们要相互配合在一起,我们可以找到的参数,对于我们具体的风格字符串,提供最好的模糊匹配的结果。

Using these two metrics, and a third which simply computes the distance between two strings, I have a series of variables which I can run an optimization algorithm to achieve the greatest number of matches. Fuzzy string matching is, itself, a fuzzy science, and so by creating linearly independent metrics for measuring string similarity, and having a known set of strings we wish to match to each other, we can find the parameters that, for our specific styles of strings, give the best fuzzy match results.

起初,度量的目标是有一个精确匹配的低搜索值,并越来越多地置换措施提高搜索值。在一个不切实际的情况下,这是相当容易定义利用一组良好定义置换和工程最终公式,使得它们具有增加所希望的搜索值的结果。

Initially, the goal of the metric was to have a low search value for for an exact match, and increasing search values for increasingly permuted measures. In an impractical case, this was fairly easy to define using a set of well defined permutations, and engineering the final formula such that they had increasing search values results as desired.

正如你所看到的,最后两个指标,这是模糊的字符串匹配的指标,已经有一种天然的倾向,给低分到是为了匹配(下对角线)的字符串。这是非常好的。

As you can see, the last two metrics, which are fuzzy string matching metrics, already have a natural tendency to give low scores to strings that are meant to match (down the diagonal). This is very good.

应用程序 要允许模糊匹配的优化,我体重每个指标。因此,模糊字符串匹配每一个应用程序可以采用不同权重的参数。定义最终得分的公式是一个简单组合的指标及其权重的:

Application To allow the optimization of fuzzy matching, I weight each metric. As such, every application of fuzzy string match can weight the parameters differently. The formula that defines the final score is a simply combination of the metrics and their weights:

value = Min(phraseWeight*phraseValue, wordsWeight*wordsValue)*minWeight
      + Max(phraseWeight*phraseValue, wordsWeight*wordsValue)*maxWeight
      + lengthWeight*lengthValue

使用最优化算法(神经网络是这里最好的,因为它是一个离散的,多维的问题),目标是现在最大化匹配的数量。我所创建的功能,其检测各组彼此的正确匹配的数目,如在此最后的屏幕截图所示。列或行得一分,如果最低分是分配的,是为了要匹配的字符串,偏分列,如果有一个并列得分最低,而正确的匹配是绑匹配的字符串中。我再优化它。你可以看到,一个绿色的细胞是,目前的行最匹配的列中,并在细胞周围蓝色正方形是最好的当前列相匹配的行。在底角的得分大约是成功匹配的数量,这是我们所告诉我们的最优化问题,以最大限度地提高。

Using an optimization algorithm (neural network is best here because it is a discrete, multi-dimentional problem), the goal is now to maximize the number of matches. I created a function that detects the number of correct matches of each set to each other, as can be seen in this final screenshot. A column or row gets a point if the lowest score is assigned the the string that was meant to be matched, and partial points are given if there is a tie for the lowest score, and the correct match is among the tied matched strings. I then optimized it. You can see that a green cell is the column that best matches the current row, and a blue square around the cell is the row that best matches the current column. The score in the bottom corner is roughly the number of successful matches and this is what we tell our optimization problem to maximize.

该算法是一个伟大的胜利,和解决方案的参数说了很多关于这类问题。你会发现在优化得分为44,而最好的成绩是48。5列在最后的诱饵,并没有任何比赛在所有的行值。更诱饵存在,就越难自然会找到最好的比赛。

The algorithm was a wonderful success, and the solution parameters say a lot about this type of problem. You'll notice the optimized score was 44, and the best possible score is 48. The 5 columns at the end are decoys, and do not have any match at all to the row values. The more decoys there are, the harder it will naturally be to find the best match.

在这个特殊的匹配情况下,该字符串的长度是无关紧要的,因为我们期望的重新present较长的单词的缩写,因此对于长的最优权重为-0.3,这意味着我们不惩罚这改变弦长。我们将比分缩小预期这些缩写,给予局部字更多的空间相匹配,以取代非字匹配,仅仅需要较少的替代,因为该字符串较短。

In this particular matching case, the length of the strings are irrelevant, because we are expecting abbreviations that represent longer words, so the optimal weight for length is -0.3, which means we do not penalize strings which vary in length. We reduce the score in anticipation of these abbreviations, giving more room for partial word matches to supersede non-word matches that simply require less substitutions because the string is shorter.

字重量是1.0,而短语重量仅为0.5,这意味着我们惩罚全字从一个字符串值更整个短语是完好缺失。这是有用的,因为很多这些字符串的共同点(的危险)一个字,其中真正重要的是与否的组合(区域和危险)保持不变。

The word weight is 1.0 while the phrase weight is only 0.5, which means that we penalize whole words missing from one string and value more the entire phrase being intact. This is useful because a lot of these strings have one word in common (the peril) where what really matters is whether or not the combination (region and peril) are maintained.

最后,最小的重量进行了优化,在10和最大权重为1。这句话的意思是,如果最好的两个成绩(价值短语和价值的话)不是很好,本场比赛有很大惩罚,但是我们不大大惩罚最严重的两个分数的。从本质上讲,这种着重于需要的或者的的valueWord或valuePhrase有一个不错的成绩,但不能同时使用。一种采取什么我们可以得到的心态。

Finally, the min weight is optimized at 10 and the max weight at 1. What this means is that if the best of the two scores (value phrase and value words) isn't very good, the match is greatly penalized, but we don't greatly penalize the worst of the two scores. Essentially, this puts emphasis on requiring either the valueWord or valuePhrase to have a good score, but not both. A sort of "take what we can get" mentality.

这真的很有趣的东西,这些5权重的优化价值说的模糊字符串匹配发生的排序。对的模糊匹配串完全不同的实际情况下,这些参数有很大的不同。我已经使用了3个独立的应用程序为止。

It's really fascinating what the optimized value of these 5 weights say about the sort of fuzzy string matching taking place. For completely different practical cases of fuzzy string matching, these parameters are very different. I've used it for 3 separate applications so far.

而不用在最后的优化,标杆片建立了匹配列,以自己为所有完美的效果下角,并允许用户更改参数以控制分数从0发散的速度,并注意与先天的相似性搜索词组(在理论上可以被用来抵消结果假阳性)

While unused in the final optimization, a benchmarking sheet was established which matches columns to themselves for all perfect results down the diagonal, and lets the user change parameters to control the rate at which scores diverge from 0, and note innate similarities between search phrases (which could in theory be used to offset false positives in the results)

更多应用程序

此溶液具有的潜力可用于任何地方的用户希望有一个计算机系统识别的字符串中字符串集在没有完美的匹配。 (就像一个近似匹配VLOOKUP的字符串)。

This solution has potential to be used anywhere where the user wishes to have a computer system identify a string in a set of strings where there is no perfect match. (Like an approximate match vlookup for strings).

所以,你应该从这个,就是你可能想使用的高层次启发式的组合(找到的话,从一句话中的其他短语,无论短语的长度等),随着Levenshtein距离的执行算法。因为决定哪些是最好的比赛是一种启发式(模糊)的测定 - 你必须拿出你拿出来确定相似度任何度量一组权重。

So what you should take from this, is that you probably want to use a combination of high level heuristics (finding words from one phrase in the other phrase, length of both phrases, etc) along with the implementation of the Levenshtein distance algorithm. Because deciding which is the "best" match is a heuristic (fuzzy) determination - you'll have to come up with a set of weights for any metrics you come up with to determine similarity.

通过相应的一套启发式和权重,你有你比较程序快速进行,你会作出的决定。

With the appropriate set of heuristics and weights, you'll have your comparison program quickly making the decisions that you would have made.

这篇关于获得最接近的字符串匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆