在Excel中查找行值之间的文本相似性 [英] Finding text similarities between row values in excel

查看:936
本文介绍了在Excel中查找行值之间的文本相似性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们说我有9行记录.每3行具有相同的值.例如:

Lets say I have 9 rows of records. Each 3 rows have the same value. For instance:

Mike  
Mike  
Mike  
John  
John  
John  
Ryan  
Ryan  
Ryan

有没有一种方法可以搜索这些记录的相似性?例如,拼写错误,其他字符,缺少字符等.因此,例如,正确的版本是Mike,但是列表中可能有一个记录,值Mke不正确(拼写错误).我如何找到它并将其替换为正确的呢?

Is there a way I can search for similarities of these records? For example spelling mistakes, additional characters, missing characters, etc. So, for example, the correct version is Mike, but there might be a record down in the list with value Mke which is incorrect (spelling mistake). How can I find this and replace it with the correct one?

上面的示例显然得到了简化.我实际上有约1百万行.现在,为了实现元素的分组",我只是按字母顺序对它们进行排序.

The above example is obviously simplified. I actually have ~1mln rows. Right now to achieve the 'grouping' of the elements I just sort them alphabetically.

推荐答案

我正面临着完全相同的问题!通过几次搜索,我可以获取并修改以下VBA代码,该代码将启用名为=Similarity()的功能.根据两个输入单元格的相似性,此函数将输出从0到1的数字.

I was facing the exact same problem! With a few searches I could get and modify the following VBA code that will enable a function named =Similarity(). This function will output a number that goes from 0 to 1, according to the similarity of the two input cells.

  • 我如何使用它:

我按字母顺序排列了我的列信息,并应用了公式.然后,我创建了Conditional Formatting Rule来突出显示相似度高(即:至少65%)的相似度.然后,我搜索每个突出显示的事件并手动修复记录.

I ordered alphabetically my column info and applied the formula. Then I created a Conditional Formatting Rule to highlight the ones with a high similarity rate (i.e.: at least 65%). Then I searched for each highlighted occurrence and fixed my records manually.

  • 用法:

=Similarity(cell1, cell2)

Obs .:相似性指标从0变为1(从0%到100%)

  • 示例:

  • 要使用它,您必须:

  1. 打开VBE( Alt + F11 )
  2. 插入模块
  3. 将以下代码粘贴到模块窗口"中
  1. Open VBE (Alt+F11)
  2. Insert Module
  3. Paste the following code into the Module Window

代码:

Public Function Similarity(ByVal String1 As String, _
    ByVal String2 As String, _
    Optional ByRef RetMatch As String, _
    Optional min_match = 1) As Single

Dim b1() As Byte, b2() As Byte
Dim lngLen1 As Long, lngLen2 As Long
Dim lngResult As Long

If UCase(String1) = UCase(String2) Then
    Similarity = 1
Else:
    lngLen1 = Len(String1)
    lngLen2 = Len(String2)
    If (lngLen1 = 0) Or (lngLen2 = 0) Then
        Similarity = 0
    Else:
        b1() = StrConv(UCase(String1), vbFromUnicode)
        b2() = StrConv(UCase(String2), vbFromUnicode)
        lngResult = Similarity_sub(0, lngLen1 - 1, _
        0, lngLen2 - 1, _
        b1, b2, _
        String1, _
        RetMatch, _
        min_match)
        Erase b1
        Erase b2
        If lngLen1 >= lngLen2 Then
            Similarity = lngResult / lngLen1
        Else
            Similarity = lngResult / lngLen2
        End If
    End If
End If

End Function

Private Function Similarity_sub(ByVal start1 As Long, ByVal end1 As Long, _
                                ByVal start2 As Long, ByVal end2 As Long, _
                                ByRef b1() As Byte, ByRef b2() As Byte, _
                                ByVal FirstString As String, _
                                ByRef RetMatch As String, _
                                ByVal min_match As Long, _
                                Optional recur_level As Integer = 0) As Long
'* CALLED BY: Similarity *(RECURSIVE)

Dim lngCurr1 As Long, lngCurr2 As Long
Dim lngMatchAt1 As Long, lngMatchAt2 As Long
Dim I As Long
Dim lngLongestMatch As Long, lngLocalLongestMatch As Long
Dim strRetMatch1 As String, strRetMatch2 As String

If (start1 > end1) Or (start1 < 0) Or (end1 - start1 + 1 < min_match) _
Or (start2 > end2) Or (start2 < 0) Or (end2 - start2 + 1 < min_match) Then
    Exit Function '(exit if start/end is out of string, or length is too short)
End If

For lngCurr1 = start1 To end1
    For lngCurr2 = start2 To end2
        I = 0
        Do Until b1(lngCurr1 + I) <> b2(lngCurr2 + I)
            I = I + 1
            If I > lngLongestMatch Then
                lngMatchAt1 = lngCurr1
                lngMatchAt2 = lngCurr2
                lngLongestMatch = I
            End If
            If (lngCurr1 + I) > end1 Or (lngCurr2 + I) > end2 Then Exit Do
        Loop
    Next lngCurr2
Next lngCurr1

If lngLongestMatch < min_match Then Exit Function

lngLocalLongestMatch = lngLongestMatch
RetMatch = ""

lngLongestMatch = lngLongestMatch _
+ Similarity_sub(start1, lngMatchAt1 - 1, _
start2, lngMatchAt2 - 1, _
b1, b2, _
FirstString, _
strRetMatch1, _
min_match, _
recur_level + 1)
If strRetMatch1 <> "" Then
    RetMatch = RetMatch & strRetMatch1 & "*"
Else
    RetMatch = RetMatch & IIf(recur_level = 0 _
    And lngLocalLongestMatch > 0 _
    And (lngMatchAt1 > 1 Or lngMatchAt2 > 1) _
    , "*", "")
End If


RetMatch = RetMatch & Mid$(FirstString, lngMatchAt1 + 1, lngLocalLongestMatch)


lngLongestMatch = lngLongestMatch _
+ Similarity_sub(lngMatchAt1 + lngLocalLongestMatch, end1, _
lngMatchAt2 + lngLocalLongestMatch, end2, _
b1, b2, _
FirstString, _
strRetMatch2, _
min_match, _
recur_level + 1)

If strRetMatch2 <> "" Then
    RetMatch = RetMatch & "*" & strRetMatch2
Else
    RetMatch = RetMatch & IIf(recur_level = 0 _
    And lngLocalLongestMatch > 0 _
    And ((lngMatchAt1 + lngLocalLongestMatch < end1) _
    Or (lngMatchAt2 + lngLocalLongestMatch < end2)) _
    , "*", "")
End If

Similarity_sub = lngLongestMatch

End Function

  • 根据您的数据集输出:
    • Output according to your data set:
    • 这篇关于在Excel中查找行值之间的文本相似性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆