模糊字符串匹配优化(不检查某些单词)-Excel VBA函数 [英] Fuzzy string matching optimization (not checking certain words) - Excel VBA function

查看:231
本文介绍了模糊字符串匹配优化(不检查某些单词)-Excel VBA函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Excel中有一个函数可以计算两个字符串之间的Levenshtein距离(将一个字符串转换为另一个字符串所需的插入,删除和/或替换的数量).我将此作为我正在研究的项目的一部分,该项目涉及模糊字符串匹配".

I have a function in Excel that calculates the Levenshtein Distance between two strings (the number of insertions, deletions, and/or substitutions needed to transform one string into another). I am using this as part of a project I'm working on that involves "fuzzy string matching."

在下面,您将看到LevenshteinDistance函数和valuePhrase函数的代码.后者的存在是为了在我的电子表格中执行该功能.我是根据我在此线程中阅读的内容得出的.

Below you will see the code for the LevenshteinDistance function and a valuePhrase function. The latter exists for the purposes of executing the function in my spreadsheet. I have taken this from what I read in this thread.

'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)`

Public Function LevenshteinDistance(ByRef S1 As String, ByVal S2 As String) As Long
    Dim L1 As Long, L2 As Long, D() As Long 'Length of input strings and distance matrix
    Dim i As Long, j As Long, cost As Long 'loop counters and cost of 
        'substitution for current letter
    Dim cI As Long, cD As Long, cS As Long 'cost of next Insertion, Deletion and 
        Substitution

    L1 = Len(S1): L2 = Len(S2)
    ReDim D(0 To L1, 0 To L2)
    For i = 0 To L1: D(i, 0) = i: Next i
    For j = 0 To L2: D(0, j) = j: Next j

    For j = 1 To L2
        For i = 1 To L1
            cost = Abs(StrComp(Mid$(S1, i, 1), Mid$(S2, j, 1), vbTextCompare))
            cI = D(i - 1, j) + 1
            cD = D(i, j - 1) + 1
            cS = D(i - 1, j - 1) + cost
            If cI <= cD Then 'Insertion or Substitution
                If cI <= cS Then D(i, j) = cI Else D(i, j) = cS
            Else 'Deletion or Substitution
                If cD <= cS Then D(i, j) = cD Else D(i, j) = cS
            End If
        Next i
    Next j
    LevenshteinDistance = D(L1, L2)

End Function

Public Function valuePhrase#(ByRef S1$, ByRef S2$)

    valuePhrase = LevenshteinDistance(S1, S2)

End Function

我正在一张工作表中的一个表中执行此valuePhrase函数,其中列标题和行标题是保险公司的名称.理想情况下,任何给定行中的最小数字(Levenshtein距离最短)应对应于表中保险公司名称与该行标题中该保险公司名称最接近的列标题.

I am executing this valuePhrase function in a table in one of my sheets where the column and row headers are names of insurance companies. Ideally, the smallest number in any given row (the shortest Levenshtein distance) should correspond to a column header with the name of the insurance company in the table that most closely matches the name of that insurance company in the row header.

我的问题是,当所讨论的字符串是保险公司的名称时,我正在尝试计算此值.考虑到这一点,上面的代码严格地计算了Levenshtein距离,并且并非专门针对这种情况而设计.举例说明,这可能是一个问题的简单示例,因为两个保险公司名称之间的Levenshtein距离如果两者都共用"insurance"和"company"这两个词,则它们之间的距离可能会很小(如您所期望的,这很常见),即使保险公司在其唯一词方面的名称完全不同.因此,我可能希望函数在比较两个字符串时忽略这些单词.

My problem is that I am trying to calculate this in a case where the strings in question are names of insurance companies. With that in mind, the code above strictly calculates the Levenshtein distance and is not tailored specifically to this case. To illustrate, a simple example of why this can be an issue is because the Levenshtein distance between two insurance company names can be quite small if they both share the words "insurance" and "company" (which, as you might expect, is common), even if the insurance companies have totally different names with respect to their unique words. So, I may want the function to ignore those words when comparing two strings.

我是VBA的新手.有没有办法可以在代码中实现此修复程序?作为第二个问题,比较保险公司的名称还会引起其他独特的问题吗?谢谢您的帮助!

I am new to VBA. Is there a way I can implement this fix in the code? As a secondary question, are there other unique issues that could arise from comparing the names of insurance companies? Thank you for the help!

推荐答案

您的整个问题都可以由如何在VBA中使用替换功能?"代替.通常,问题中的算法看起来很有趣,因此,我已经为您完成了这项工作.只需在函数的Array()中添加任何内容,它便会起作用(只需以小写形式写入数组中的值):

Your whole question can be replaced by "How do I use the replace function in VBA?". In general, the algorithm in the question looked interesting, thus I have done this for you. Simply add anything in the Array() of the function, it will work (Just write in lower case the values in the array):

Public Function removeSpecificWords(s As String) As String

 Dim arr     As Variant
 Dim cnt     As Long

 arr = Array("insurance", "company", "firma", "firm", "holding")
 removeSpecificWords = s

 For cnt = LBound(arr) To UBound(arr)
  removeSpecificWords = Replace(LCase(removeSpecificWords), LCase(arr(cnt)), vbNullString)
 Next cnt

End Function

Public Sub TestMe()

    Debug.Print removeSpecificWords("InsHolding")
    Debug.Print removeSpecificWords("InsuranceInsHoldingStar")

End Sub

在您的情况下:

    S1 = removeSpecificWords(S1)
    S2 = removeSpecificWords(S2)
    valuePhrase = LevenshteinDistance(S1, S2)

这篇关于模糊字符串匹配优化(不检查某些单词)-Excel VBA函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆