潜在重复检测,具有 3 个严重级别 [英] Potential Duplicates Detection, with 3 Severity Level

查看:11
本文介绍了潜在重复检测,具有 3 个严重级别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想制作一个程序来检测具有 3 个严重级别的潜在重复项.让我们考虑一下我的数据只有两列,但有数千行.第二列中的数据仅用逗号分隔.数据示例:

I wanna make a program that detect a potential duplicates with 3 severity level. let consider my data is only in two column, but with thousands row. data in second column delimited only with comma. data example :

Number | Material
1      | helmet,valros,42
2      | helmet,iron,knight
3      | valros,helmet,42
4      | knight,helmet
5      | valros,helmet,42
6      | plain,helmet
7      | helmet, leather

我的 3 个级别是:

非常高: A,B,C vs A,B,C

very high : A,B,C vs A,B,C

高: A,B,C vs B,C,A

high : A,B,C vs B,C,A

一般: A,B,C vs A,B

so so : A,B,C vs A,B

目前我只能做第一关,我不知道如何做第二和第三关.

so far i just can make the first level, i don't know how to do the 2nd and the 3rd level.

我尝试过的.

Sub duplicates_separation()

    Dim duplicate(), i As Long
    Dim delrange As Range, cell As Long
    Dim shtIn As Worksheet, shtOut As Worksheet



     Set shtIn = ThisWorkbook.Sheets("input")
    Set shtOut = ThisWorkbook.Sheets("output")

    x = 2
    y = 1

    Set delrange = shtIn.Range("b1:b10000")  'set your range here

   ReDim duplicate(0)
'search duplicates in 2nd column
    For cell = 1 To delrange.Cells.Count
        If Application.CountIf(delrange, delrange(cell)) > 1 Then
            ReDim Preserve duplicate(i)
            duplicate(i) = delrange(cell).Address
            i = i + 1
        End If
    Next


        'print duplicates
    For i = UBound(duplicate) To LBound(duplicate) Step -1
    shtOut.Cells(x, 1).EntireRow.Value = shtIn.Range(duplicate(i)).EntireRow.Value


End Sub

程序检测到的重复项:

3      | valros,helmet,42
 5      | valros,helmet,42

我的预期:

Number | Material
1      | helmet,valros,42
3      | valros,helmet,42
5      | valros,helmet,42
4      | knight,helmet
2      | helmet,iron,knight        

我有一个检测重复 lv 2 的想法,但我认为它会很复杂,并且会使程序变慢.

i have an idea for detecting duplicates lv 2, but I think it will be so complicated and make the program slow.

  1. 使用文本到列"命令将第 2 列转换为列
  2. 从 A 到 Z 对列进行排序(按字母顺序)
  3. 连接列
  4. 在检测重复 lv 1 时像做一样计数

有没有办法检测第二个 &第三级重复?

is there a way to detect the 2nd & 3rd level duplicates?

更新

昨天我去朋友家咨询这个问题,但是他的解决方案是JAVA语言..>我不明白

Yesterday I went to friend's house to consult about this problem, but his solution is in JAVA languange.. >which i don't understand

public class ali {

    static void sPrint(String[] Printed) {
        for (int iC = 0; iC < Printed.length; iC++) {
            System.out.print(String.valueOf(Printed[iC]) + " | ");
        }
        System.out.println();
    }

    public static void main(String Args[]) {
        int defaultLength = 10;
        int indexID = 0;
        int indexDesc = 1;
        String[] DETECTORP1 = new String[defaultLength];
        String[] DETECTORP2 = new String[defaultLength];
        String[] DETECTORP3 = new String[defaultLength];
        String[] DETECTORP4 = new String[defaultLength];
        String[][] theString = new String[5][2];
        theString[0] = new String[]{"1", "A, B, C, D"};
        theString[1] = new String[]{"2", "A, B, C, D"};
        theString[2] = new String[]{"3", "A, B, C, D, E"};
        theString[3] = new String[]{"4", "A, B, D, C, E"};
        theString[4] = new String[]{"5", "A, B, D, C, E, F"};
        int P1 = 0;
        int P2 = 0;
        int P3 = 0;
        int P4 = 0;
        for (int iC = 0; iC < theString.length; iC++) {
            System.out.println(theString[iC][indexID] + " -> " + theString[iC][indexDesc]);
        }
        for (int iC = 0; iC < theString.length; iC++) {
            int LEX;
            String theReference[] = theString[iC][indexDesc].replace(",", ";;").split(";;");
            for (int iD = 0; iD < theString.length; iD++) {
                if (iC != iD) {
                    String theCompare[] = theString[iD][1].replace(",", ";;").split(";;");
                    if (theReference.length == theCompare.length) {
                        LEX=0;
                        int theLength = theReference.length;
                        for (int iE = 0; iE < theLength; iE++) {
                            if (theReference[iE].equals(theCompare[iE])) {
                                LEX += 1;
                            }
                        }
                        if (LEX == theLength) {
                            DETECTORP1[P1] = theString[iC][indexID] + " WITH " + theString[iD][indexID];
                            P1 += 1;
                        } else {
                            LEX = 0;
                            for (int iF = 0; iF < theReference.length; iF++) {
                                for (int iG = 0; iG < theCompare.length; iG++) {
                                    if (theReference[iF].equals(theCompare[iG])) {
                                        LEX += 1;
                                        break;
                                    }
                                }
                            }
                            if (LEX == theReference.length) {
                                DETECTORP2[P2] = theString[iC][indexID] + " WITH " + theString[iD][indexID];
                                P2 += 1;
                            }

                        }

                    } else {
                        LEX = 0;
                        if (theReference.length > theCompare.length) {
                            for (int iF = 0; iF < theReference.length; iF++) {
                                for (int iG = 0; iG < theCompare.length; iG++) {
                                    if (iG == iF) {
                                        if (theReference[iF].equals(theCompare[iF])) {
                                            LEX += 1;
                                            break;
                                        }
                                    }
                                }
                            }
                            if (LEX <= theReference.length && LEX >= theCompare.length) {
                                DETECTORP3[P3] = theString[iC][indexID] + " WITH " + theString[iD][indexID];
                                P3 += 1;
                            }
                        } else {
                            LEX =0;
                            for (int iF = 0; iF < theCompare.length; iF++) {
                                for (int iG = 0; iG < theReference.length; iG++) {
                                    if (iG == iF) {
                                        if (theCompare[iF].equals(theReference[iF])) {
                                            LEX += 1;
                                        //    System.out.println(theReference[iG] + "==" + theCompare[iG]);
                                            break;
                                        }
                                    }
                                }
                            }
                            if (LEX <= theCompare.length && LEX >= theReference.length) {
                                DETECTORP3[P3] = theString[iC][indexID] + " WITH " + theString[iD][indexID];
                                P3 += 1;
                            }
                        }

                    }
                }

            }

        }
        sPrint(DETECTORP1);
        sPrint(DETECTORP2);
        sPrint(DETECTORP3);
    }
}

如何在 VBA 中做到这一点?

how to do this in VBA?

推荐答案

真的,这取决于您要如何定义严重性级别".这是一种方法,不一定是最好的:使用 Levensthein 距离.

Really, it depends how you want to define "severity level". Here's one way to do it, not necessarily the best: Use the Levensthein distance.

用一个字符的属性符号来表示您的每件商品,例如

Represent each of your items by a one-character attribute symbol, e.g.

H    helmet
K    knight
I    iron
$    Leather
^    Valros
╔    Plain
¢    Whatever
etc.

然后将您的材质列表转换为包含代表这些属性的字符序列的字符串:

Then convert your Material lists into a string containing sequence of characters representing these attributes:

HIK = helmet,iron,knight
¢H  = plain,helmet

然后计算这两个字符串之间的 Levenshtein 距离.那将是您的严重程度".

Then compute the Levenshtein distance between those two strings. That will be your "severity level".

Debug.Print LevenshteinDistance("HIK","¢H")
'returns 3

维基百科中显示了 Levenshtein 距离的两种实现.确实你很幸运:StackOverflow 上的某个人将此移植到了 VBA.

Two implementations of the Levenshtein distance are shown in Wikipedia. And indeed you are in luck: someone on StackOverflow ported this to VBA.

在下面的评论部分,您说您不喜欢用一个字符的符号来表示每个可能的属性.很公平;我同意这有点傻.解决方法:事实上,可以调整 Levenshtein 距离算法,使其不查看字符串中的每个字符,而是查看数组的每个元素,并在此基础上进行比较.我将在 我的回答 中对您的后续问题展示如何进行此更改.

In the comments section below, you say you don't like having to represent each of your possible attributes by one-character symbols. Fair enough; I agree this is a bit silly. Workaround: It is, in fact, possible to adapt the Levenshtein Distance algorithm to look not at each character in a string, but at each element of an array instead, and do comparisons based on that. I show how to make this change in my answer to your follow-up question.

这篇关于潜在重复检测,具有 3 个严重级别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆