潜在重复检测，具有3级严重性级别 [英] Potential Duplicates Detection, with 3 Severity Level

查看：123 发布时间：2017/7/20 23:09:00 excel vba excel-vba duplicates

本文介绍了潜在重复检测，具有3级严重性级别的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想制作一个程序，以3严重级别检测潜在的重复。
让我考虑我的数据只有两列，但是有数千行。第二列中的
数据只用逗号分隔。数据示例：

I wanna make a program that detect a potential duplicates with 3 severity level. let consider my data is only in two column, but with thousands row. data in second column delimited only with comma. data example :

Number | Material
1      | helmet,valros,42
2      | helmet,iron,knight
3      | valros,helmet,42
4      | knight,helmet
5      | valros,helmet,42
6      | plain,helmet
7      | helmet, leather

，我的3级是：

非常高： A，B，C vs A，B，C

very high : A,B,C vs A,B,C

： A，B，C 对 B，C，A

high : A,B,C vs B,C,A

B，C vs A，B

到目前为止，我只能做第一级，我不知道如何做第二和第三级。

so far i just can make the first level, i don't know how to do the 2nd and the 3rd level.

我曾尝试过。

Sub duplicates_separation() Dim duplicate(), i As Long Dim delrange As Range, cell As Long Dim shtIn As Worksheet, shtOut As Worksheet Set shtIn = ThisWorkbook.Sheets("input") Set shtOut = ThisWorkbook.Sheets("output") x = 2 y = 1 Set delrange = shtIn.Range("b1:b10000") 'set your range here ReDim duplicate(0) 'search duplicates in 2nd column For cell = 1 To delrange.Cells.Count If Application.CountIf(delrange, delrange(cell)) > 1 Then ReDim Preserve duplicate(i) duplicate(i) = delrange(cell).Address i = i + 1 End If Next 'print duplicates For i = UBound(duplicate) To LBound(duplicate) Step -1 shtOut.Cells(x, 1).EntireRow.Value = shtIn.Range(duplicate(i)).EntireRow.Value End Sub

程序检测到的副本：

3 | valros,helmet,42 5 | valros,helmet,42

我的预期：

Number | Material 1 | helmet,valros,42 3 | valros,helmet,42 5 | valros,helmet,42 4 | knight,helmet 2 | helmet,iron,knight

我有一个想法检测重复lv 2，但我认为这将是

i have an idea for detecting duplicates lv 2, but I think it will be so complicated and make the program slow.

将列2转换为列为text to columns命令

连接列

如果在检测到重复项中已经完成了这个列表lv 1

turn column 2 to columns with "text to columns" command

sort column from A to Z (alphabetically)

concatenate the column

do countif like done in detecting duplicates lv 1

是否有一种方法来检测第二&第三级重复？

is there a way to detect the 2nd & 3rd level duplicates?

更新

昨天我去朋友的房子咨询这个问题，但他的解决方案是在JAVA languange ..>我不明白

Yesterday I went to friend's house to consult about this problem, but his solution is in JAVA languange.. >which i don't understand

public class ali { static void sPrint(String[] Printed) { for (int iC = 0; iC < Printed.length; iC++) { System.out.print(String.valueOf(Printed[iC]) + " | "); } System.out.println(); } public static void main(String Args[]) { int defaultLength = 10; int indexID = 0; int indexDesc = 1; String[] DETECTORP1 = new String[defaultLength]; String[] DETECTORP2 = new String[defaultLength]; String[] DETECTORP3 = new String[defaultLength]; String[] DETECTORP4 = new String[defaultLength]; String[][] theString = new String[5][2]; theString[0] = new String[]{"1", "A, B, C, D"}; theString[1] = new String[]{"2", "A, B, C, D"}; theString[2] = new String[]{"3", "A, B, C, D, E"}; theString[3] = new String[]{"4", "A, B, D, C, E"}; theString[4] = new String[]{"5", "A, B, D, C, E, F"}; int P1 = 0; int P2 = 0; int P3 = 0; int P4 = 0; for (int iC = 0; iC < theString.length; iC++) { System.out.println(theString[iC][indexID] + " -> " + theString[iC][indexDesc]); } for (int iC = 0; iC < theString.length; iC++) { int LEX; String theReference[] = theString[iC][indexDesc].replace(",", ";;").split(";;"); for (int iD = 0; iD < theString.length; iD++) { if (iC != iD) { String theCompare[] = theString[iD][1].replace(",", ";;").split(";;"); if (theReference.length == theCompare.length) { LEX=0; int theLength = theReference.length; for (int iE = 0; iE < theLength; iE++) { if (theReference[iE].equals(theCompare[iE])) { LEX += 1; } } if (LEX == theLength) { DETECTORP1[P1] = theString[iC][indexID] + " WITH " + theString[iD][indexID]; P1 += 1; } else { LEX = 0; for (int iF = 0; iF < theReference.length; iF++) { for (int iG = 0; iG < theCompare.length; iG++) { if (theReference[iF].equals(theCompare[iG])) { LEX += 1; break; } } } if (LEX == theReference.length) { DETECTORP2[P2] = theString[iC][indexID] + " WITH " + theString[iD][indexID]; P2 += 1; } } } else { LEX = 0; if (theReference.length > theCompare.length) { for (int iF = 0; iF < theReference.length; iF++) { for (int iG = 0; iG < theCompare.length; iG++) { if (iG == iF) { if (theReference[iF].equals(theCompare[iF])) { LEX += 1; break; } } } } if (LEX <= theReference.length && LEX >= theCompare.length) { DETECTORP3[P3] = theString[iC][indexID] + " WITH " + theString[iD][indexID]; P3 += 1; } } else { LEX =0; for (int iF = 0; iF < theCompare.length; iF++) { for (int iG = 0; iG < theReference.length; iG++) { if (iG == iF) { if (theCompare[iF].equals(theReference[iF])) { LEX += 1; // System.out.println(theReference[iG] + "==" + theCompare[iG]); break; } } } } if (LEX <= theCompare.length && LEX >= theReference.length) { DETECTORP3[P3] = theString[iC][indexID] + " WITH " + theString[iD][indexID]; P3 += 1; } } } } } } sPrint(DETECTORP1); sPrint(DETECTORP2); sPrint(DETECTORP3); } }

如何在VBA中执行此操作？

how to do this in VBA?

推荐答案

真的，这取决于你想如何定义严重性级别。这是一种方法，不一定是最好的：使用Levensthein距离。

Really, it depends how you want to define "severity level". Here's one way to do it, not necessarily the best: Use the Levensthein distance.

用单字符属性符号表示您的每个项目，例如

Represent each of your items by a one-character attribute symbol, e.g.

H helmet K knight I iron $ Leather ^ Valros ╔ Plain ¢ Whatever etc.

然后转换您的物料列表包含代表这些属性的字符序列：

Then convert your Material lists into a string containing sequence of characters representing these attributes:

HIK = helmet,iron,knight ¢H = plain,helmet

然后计算这两个字符串之间的Levenshtein距离。那将是你的严重程度。

Then compute the Levenshtein distance between those two strings. That will be your "severity level".

Debug.Print LevenshteinDistance("HIK","¢H") 'returns 3

Levenshtein距离的两个实现显示在维基百科。实际上你是运气好的：StackOverflow上有人将其移植到VBA 。

Two implementations of the Levenshtein distance are shown in Wikipedia. And indeed you are in luck: someone on StackOverflow ported this to VBA.

在下面的评论部分，你说你不喜欢用单字符符号表示每个可能的属性。很公平;我同意这有点傻。解决方法：事实上，实际上可以调整Levenshtein Distance算法来查找字符串中的每个字符，而不是在数组的每个元素上，并根据此进行比较。我将介绍如何在我的回答中对您的后续问题进行此更改。

In the comments section below, you say you don't like having to represent each of your possible attributes by one-character symbols. Fair enough; I agree this is a bit silly. Workaround: It is, in fact, possible to adapt the Levenshtein Distance algorithm to look not at each character in a string, but at each element of an array instead, and do comparisons based on that. I show how to make this change in my answer to your follow-up question.

这篇关于潜在重复检测，具有3级严重性级别的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

潜在重复检测，具有3级严重性级别 [英] Potential Duplicates Detection, with 3 Severity Level

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

潜在重复检测，具有3级严重性级别 [英] Potential Duplicates Detection, with 3 Severity Level

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭