使用Levenshtein距离查找以逗号分隔的列表的近似重复 [英] Find near-duplicates of comma-separated lists using Levenshtein distance

查看:115
本文介绍了使用Levenshtein距离查找以逗号分隔的列表的近似重复的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题基于我的问题昨天。

This question based on the answer of my question yesterday.

为了解决我的问题,Jean-FrançoisCorbett 建议 a Levenshtein距离方法。然后我发现这个代码在某处得到Levenshtein距离百分比。

To solve my problem, Jean-François Corbett suggested a Levenshtein distance approach. Then I found this code somewhere to get Levenshtein distance percentage.

Public Function GetLevenshteinPercentMatch( _
    ByVal string1 As String, ByVal string2 As String, _
    Optional Normalised As Boolean = False) As Single
Dim iLen As Integer
    If Normalised = False Then
        string1 = UCase$(WorksheetFunction.Trim(string1))
        string2 = UCase$(WorksheetFunction.Trim(string2))
    End If
    iLen = WorksheetFunction.Max(Len(string1), Len(string2))
    GetLevenshteinPercentMatch = (iLen - LevenshteinDistance(string1, string2)) / iLen
End Function

'********************************
'*** Compute Levenshtein Distance
'********************************

Public Function LevenshteinDistance(ByVal s As String, ByVal t As String) As Integer
Dim d() As Integer ' matrix
Dim m As Integer ' length of t
Dim N As Integer ' length of s
Dim i As Integer ' iterates through s
Dim j As Integer ' iterates through t
Dim s_i As String ' ith character of s
Dim t_j As String ' jth character of t
Dim cost As Integer ' cost

  ' Step 1
  N = Len(s)
  m = Len(t)
  If N = 0 Then
    LevenshteinDistance = m
    Exit Function
  End If
  If m = 0 Then
    LevenshteinDistance = N
    Exit Function
  End If
  ReDim d(0 To N, 0 To m) As Integer

  ' Step 2
  For i = 0 To N
    d(i, 0) = i
  Next i

  For j = 0 To m
    d(0, j) = j
  Next j

  ' Step 3

  For i = 1 To N
    s_i = Mid$(s, i, 1)
    ' Step 4
    For j = 1 To m
      t_j = Mid$(t, j, 1)
      ' Step 5
      If s_i = t_j Then
        cost = 0
      Else
        cost = 1
      End If
      ' Step 6
      d(i, j) = WorksheetFunction.Min( _
          d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + cost)

    Next j
  Next i

  ' Step 7
  LevenshteinDistance = d(N, m)

End Function

我现在有一个代码可以在一列中找到完全重复的代码,

What I have now is a code that finds exact duplicates in one column,

 Dim duplicate(), i As Long
    Dim delrange As Range, cell As Long
    Dim shtIn As Worksheet, Shtout As Worksheet
    Dim numofrows1
    dim numofrows2
    dim j as long

    Set shtIn = ThisWorkbook.Sheets("process")
    Set Shtout = ThisWorkbook.Sheets("output")

    x = 2
    y = 1

    Set delrange = shtIn.Range("h1:h30000")  'set your range here

    ReDim duplicate(0)
    'search duplicates in 2nd column
    For cell = 1 To delrange.Cells.Count
         If Application.CountIf(delrange, delrange(cell)) > 1 Then
             ReDim Preserve duplicate(i)
             duplicate(i) = delrange(cell).Address
             i = i + 1
         End If
    Next

    'print duplicates
    For i = UBound(duplicate) To LBound(duplicate) Step -1
    Shtout.Cells(x, 1).EntireRow.Value = shtIn.Range(duplicate(i)).EntireRow.Value
    x = x + 1
Next i

numofrows2 = Shtout.Cells(Shtout.Rows.Count, 1).End(xlUp).Row - 1

If Shtout.Cells(2, 1).Value = "" Then
    MsgBox ("No Duplicates Found!")
Else
    MsgBox (numofrows1 & " " & "Potential Duplicates Found")
End If

End Sub

我认为如果我可以组合这两个代码会很好,但是Levenshtein的距离是比较2个字符串。所以它不能一起工作

I think that it will be nice if I can combine this two code, but Levenshtein distance is to compare 2 strings. So it can't work together.

我坚持在这里,因为我根本不知道,我阅读的每一个参考资料都告诉我们比较两个字符串。

I stuck here because I have no idea at all, every reference that I read all tell about comparing two string.

如果参数这个简单:检测到重复,如果Levenshtein距离百分比高于90%。

if the parameter this simple : detected as duplicate if the Levenshtein distance percentage is above 90%.

我必须在此代码中更改?

What I must change in this code?

推荐答案

我很高兴我的早期回答对你有用你不喜欢用单字符符号表示每个可能的属性...

I'm glad my earlier answer was useful to you. You didn't like having to represent each of your possible attributes by one-character symbols...

好的,当我尝试在你的评论中发信号时,可以调整Levenshtein Distance算法来查找字符串中的每个字符,而不是在数组的每个元素上进行比较,并根据此进行比较。事实上,这样做是非常简单的:

Ok, as I try to signal to you in the comments there, it is possible to adapt the Levenshtein Distance algorithm to look not at each character in a string, but at each element of an array instead, and do comparisons based on that. In fact it's quite straightforward to make this change:

之前<步骤1 ,将逗号分隔的字符串转换为数组如下:

Before 'Step 1, convert your comma-separated strings into arrays like this:

Dim sSplit() As String
Dim tSplit() As String
sSplit = Split(s, ",")
tSplit = Split(t, ",")

然后替换这四行代码

N = Len(s)
m = Len(t)
s_i = Mid$(s, i, 1)
t_j = Mid$(t, j, 1)

与这些

N = UBound(sSplit) + 1
m = UBound(tSplit) + 1
s_i = sSplit(i - 1)
t_j = tSplit(j - 1)

+ 1 - 1 在那里,因为拆分返回一个基于零的数组。

The + 1 and - 1 are there because Split returns a zero-based array.

使用示例:

?LevenshteinDistance("valros,helmet,42","valros,helmet,42")
 0 
?LevenshteinDistance("valros,helmet,42","knight,helmet")
 2 
?LevenshteinDistance("helmet,iron,knight","plain,helmet")
 3 

code> 0 表示两个字符串相同。您不需要单独的代码来处理此问题。

Note that 0 means the two strings are identical. You don't need separate code to deal with this.

使用上述,您应该可以完成任务。

With the above you should be able to complete your task.

另外一个注意事项: Damerau-Levenshtein距离可能是比 Levenshtein distance <一>。不同的是,除了插入/删除/替换之外,D-M距离还考虑了两个相邻字符的转置。由你决定。

One more note: the Damerau–Levenshtein distance may be a more relevant algorithm for you than the Levenshtein distance. The difference is that in addition to insertion/deletion/substitution, the D-M distance also considers transposition of two adjacent characters. Up to you to decide.

这篇关于使用Levenshtein距离查找以逗号分隔的列表的近似重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆