文本相似度分析(Excel) [英] Text similarity analysis (Excel)

查看:1974
本文介绍了文本相似度分析(Excel)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个项目列表,我想确定它们与该列表中其他项目的相似度。

I have a list of items and I want to identify their similarity in relation to the other items in this list.

我想要的输出将是一些事情的:

My desired output would be something along the lines of:

相似性列中显示的百分比纯粹是说明性的。我认为,相似性的测试是符合以下条件的:

The percentage shown in the similarity column is purely illustrative. I'm thinking that a test for similarity would be something along the lines of:


并发字母数/总字母数在
匹配的项目中

number of concurrent letters / by the total number of letters in the matched item

但是很想获得那个意见。

But would be keen to get opinions on that one.

这是Excel中合理可行的东西吗?我是一个只包含字母数字值的小数据集(140kb)。

Is this something which is reasonably doable on Excel? I'ts a small data set (140kb) containing only alphanumeric values.

我也可以接受这种方法,因为我以前没有处理过这样的事情!

Am also open to alternative ways of approaching this, as I haven't tackled anything like this before!

诗我一直在学习Python几个月,所以使用Python的建议也会很好!

P.s. I've been learning Python for a few months now, so suggestions using Python would also be good!

推荐答案

这是一个解决方案使用VBA UDF:

Here is a solution using a VBA UDF:

编辑:添加了一个新的可选参数,名为 arg_lMinConsecutive 用于确定必须匹配的连续字符的最小数量。请注意以下公式中的额外参数 2 ,表示至少有2个连续字符必须匹配。

EDIT: Added a new optional argument named arg_lMinConsecutive which is used to determine the minimum number of consecutive characters that must match. Note the extra argument 2 in the below formulas which indicates that at least 2 consecutive characters must match.

Public Function FuzzyMatch(ByVal arg_sText As String, _
                           ByVal arg_vList As Variant, _
                           ByVal arg_lOutput As Long, _
                           Optional ByVal arg_lMinConsecutive As Long = 1, _
                           Optional ByVal arg_bMatchCase As Boolean = True, _
                           Optional ByVal arg_bExactCount As Boolean = True) _
                As Variant

    Dim dExactCounts As Object
    Dim aResults() As Variant
    Dim vList As Variant
    Dim vListItem As Variant
    Dim sLetter As String
    Dim dMaxMatch As Double
    Dim lMaxIndex As Long
    Dim lResultIndex As Long
    Dim lLastMatch As Long
    Dim i As Long
    Dim bMatch As Boolean

    If arg_lMinConsecutive <= 0 Then
        FuzzyMatch = CVErr(xlErrNum)
        Exit Function
    End If

    If arg_bExactCount = True Then Set dExactCounts = CreateObject("Scripting.Dictionary")

    If TypeName(arg_vList) = "Collection" Or TypeName(arg_vList) = "Range" Then
        ReDim aResults(1 To arg_vList.Count, 1 To 3)
        Set vList = arg_vList
    ElseIf IsArray(arg_vList) Then
        ReDim aResults(1 To UBound(arg_vList) - LBound(arg_vList) + 1, 1 To 3)
        vList = arg_vList
    Else
        ReDim vList(1 To 1)
        vList(1) = arg_vList
        ReDim aResults(1 To 1, 1 To 3)
    End If

    dMaxMatch = 0#
    lMaxIndex = 0
    lResultIndex = 0

    For Each vListItem In vList
        If vListItem <> arg_sText Then
            lLastMatch = -arg_lMinConsecutive
            lResultIndex = lResultIndex + 1
            aResults(lResultIndex, 3) = vListItem
            If arg_bExactCount Then dExactCounts.RemoveAll
            For i = 1 To Len(arg_sText) - arg_lMinConsecutive + 1
                bMatch = False
                sLetter = Mid(arg_sText, i, arg_lMinConsecutive)
                If Not arg_bMatchCase Then sLetter = LCase(sLetter)
                If arg_bExactCount Then dExactCounts(sLetter) = dExactCounts(sLetter) + 1

                Select Case Abs(arg_bMatchCase) + Abs(arg_bExactCount) * 2
                    Case 0
                        'MatchCase is false and ExactCount is false
                        If InStr(1, vListItem, sLetter, vbTextCompare) > 0 Then bMatch = True

                    Case 1
                        'MatchCase is true and ExactCount is false
                        If InStr(1, vListItem, sLetter) > 0 Then bMatch = True

                    Case 2
                        'MatchCase is false and ExactCount is true
                        If Len(vListItem) - Len(Replace(vListItem, sLetter, vbNullString, Compare:=vbTextCompare)) >= dExactCounts(sLetter) Then bMatch = True

                    Case 3
                        'MatchCase is true and ExactCount is true
                        If Len(vListItem) - Len(Replace(vListItem, sLetter, vbNullString)) >= dExactCounts(sLetter) Then bMatch = True

                End Select

                If bMatch Then
                    aResults(lResultIndex, 1) = aResults(lResultIndex, 1) + WorksheetFunction.Min(arg_lMinConsecutive, i - lLastMatch)
                    lLastMatch = i
                End If
            Next i
            If Len(vListItem) > 0 Then
                aResults(lResultIndex, 2) = aResults(lResultIndex, 1) / Len(vListItem)
                If aResults(lResultIndex, 2) > dMaxMatch Then
                    dMaxMatch = aResults(lResultIndex, 2)
                    lMaxIndex = lResultIndex
                End If
            Else
                aResults(lResultIndex, 2) = 0
            End If
        End If
    Next vListItem

    If dMaxMatch = 0# Then
        Select Case arg_lOutput
            Case 1:     FuzzyMatch = 0
            Case 2:     FuzzyMatch = vbNullString
            Case Else:  FuzzyMatch = CVErr(xlErrNum)
        End Select
    Else
        Select Case arg_lOutput
            Case 1:     FuzzyMatch = Application.Min(1, aResults(lMaxIndex, 2))
            Case 2:     FuzzyMatch = aResults(lMaxIndex, 3)
            Case Else:  FuzzyMatch = CVErr(xlErrNum)
        End Select
    End If

End Function

仅使用A列和B列中的原始数据,您可以使用UDF获得所需的r列出C和D列:

Using only the original data in columns A and B, you can use this UDF to get the desired results in columns C and D:

单元格C2和复制的是这个公式:

In cell C2 and copied down is this formula:

=FuzzyMatch($B2,$B$2:$B$6,COLUMN(A2),2)

在单元格D2中,复制下来是这个公式:

In cell D2 and copied down is this formula:

=IFERROR(INDEX(A:A,MATCH(FuzzyMatch($B2,$B$2:$B$6,COLUMN(B2),2),B:B,0)),"-")

请注意,他们都使用 FuzzyMatch UDF。

Note that they both use the FuzzyMatch UDF.

这篇关于文本相似度分析(Excel)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆