潜在重复检测,具有3级严重性级别 [英] Potential Duplicates Detection, with 3 Severity Level
问题描述
我想制作一个程序,以3严重级别检测潜在的重复。
让我考虑我的数据只有两列,但是有数千行。第二列中的
数据只用逗号分隔。数据示例:
I wanna make a program that detect a potential duplicates with 3 severity level. let consider my data is only in two column, but with thousands row. data in second column delimited only with comma. data example :
Number | Material
1 | helmet,valros,42
2 | helmet,iron,knight
3 | valros,helmet,42
4 | knight,helmet
5 | valros,helmet,42
6 | plain,helmet
7 | helmet, leather
,我的3级是:
非常高: A,B,C vs A,B,C
very high : A,B,C vs A,B,C
: A,B,C 对 B,C,A
high : A,B,C vs B,C,A
B,C vs A,B
到目前为止,我只能做第一级,我不知道如何做第二和第三级。
so far i just can make the first level, i don't know how to do the 2nd and the 3rd level.
我曾尝试过。
Sub duplicates_separation()
Dim duplicate(), i As Long
Dim delrange As Range, cell As Long
Dim shtIn As Worksheet, shtOut As Worksheet
Set shtIn = ThisWorkbook.Sheets("input")
Set shtOut = ThisWorkbook.Sheets("output")
x = 2
y = 1
Set delrange = shtIn.Range("b1:b10000") 'set your range here
ReDim duplicate(0)
'search duplicates in 2nd column
For cell = 1 To delrange.Cells.Count
If Application.CountIf(delrange, delrange(cell)) > 1 Then
ReDim Preserve duplicate(i)
duplicate(i) = delrange(cell).Address
i = i + 1
End If
Next
'print duplicates
For i = UBound(duplicate) To LBound(duplicate) Step -1
shtOut.Cells(x, 1).EntireRow.Value = shtIn.Range(duplicate(i)).EntireRow.Value
End Sub
程序检测到的副本:
3 | valros,helmet,42
5 | valros,helmet,42
我的预期:
Number | Material
1 | helmet,valros,42
3 | valros,helmet,42
5 | valros,helmet,42
4 | knight,helmet
2 | helmet,iron,knight
我有一个想法检测重复lv 2,但我认为这将是
i have an idea for detecting duplicates lv 2, but I think it will be so complicated and make the program slow.
- 将列2转换为列为text to columns命令
- 连接列
- 如果在检测到重复项中已经完成了这个列表lv 1
- turn column 2 to columns with "text to columns" command
- sort column from A to Z (alphabetically)
- concatenate the column
- do countif like done in detecting duplicates lv 1
是否有一种方法来检测第二&第三级重复?
is there a way to detect the 2nd & 3rd level duplicates?
更新
昨天我去朋友的房子咨询这个问题,但他的解决方案是在JAVA languange ..>我不明白
Yesterday I went to friend's house to consult about this problem, but his solution is in JAVA languange.. >which i don't understand
public class ali {
static void sPrint(String[] Printed) {
for (int iC = 0; iC < Printed.length; iC++) {
System.out.print(String.valueOf(Printed[iC]) + " | ");
}
System.out.println();
}
public static void main(String Args[]) {
int defaultLength = 10;
int indexID = 0;
int indexDesc = 1;
String[] DETECTORP1 = new String[defaultLength];
String[] DETECTORP2 = new String[defaultLength];
String[] DETECTORP3 = new String[defaultLength];
String[] DETECTORP4 = new String[defaultLength];
String[][] theString = new String[5][2];
theString[0] = new String[]{"1", "A, B, C, D"};
theString[1] = new String[]{"2", "A, B, C, D"};
theString[2] = new String[]{"3", "A, B, C, D, E"};
theString[3] = new String[]{"4", "A, B, D, C, E"};
theString[4] = new String[]{"5", "A, B, D, C, E, F"};
int P1 = 0;
int P2 = 0;
int P3 = 0;
int P4 = 0;
for (int iC = 0; iC < theString.length; iC++) {
System.out.println(theString[iC][indexID] + " -> " + theString[iC][indexDesc]);
}
for (int iC = 0; iC < theString.length; iC++) {
int LEX;
String theReference[] = theString[iC][indexDesc].replace(",", ";;").split(";;");
for (int iD = 0; iD < theString.length; iD++) {
if (iC != iD) {
String theCompare[] = theString[iD][1].replace(",", ";;").split(";;");
if (theReference.length == theCompare.length) {
LEX=0;
int theLength = theReference.length;
for (int iE = 0; iE < theLength; iE++) {
if (theReference[iE].equals(theCompare[iE])) {
LEX += 1;
}
}
if (LEX == theLength) {
DETECTORP1[P1] = theString[iC][indexID] + " WITH " + theString[iD][indexID];
P1 += 1;
} else {
LEX = 0;
for (int iF = 0; iF < theReference.length; iF++) {
for (int iG = 0; iG < theCompare.length; iG++) {
if (theReference[iF].equals(theCompare[iG])) {
LEX += 1;
break;
}
}
}
if (LEX == theReference.length) {
DETECTORP2[P2] = theString[iC][indexID] + " WITH " + theString[iD][indexID];
P2 += 1;
}
}
} else {
LEX = 0;
if (theReference.length > theCompare.length) {
for (int iF = 0; iF < theReference.length; iF++) {
for (int iG = 0; iG < theCompare.length; iG++) {
if (iG == iF) {
if (theReference[iF].equals(theCompare[iF])) {
LEX += 1;
break;
}
}
}
}
if (LEX <= theReference.length && LEX >= theCompare.length) {
DETECTORP3[P3] = theString[iC][indexID] + " WITH " + theString[iD][indexID];
P3 += 1;
}
} else {
LEX =0;
for (int iF = 0; iF < theCompare.length; iF++) {
for (int iG = 0; iG < theReference.length; iG++) {
if (iG == iF) {
if (theCompare[iF].equals(theReference[iF])) {
LEX += 1;
// System.out.println(theReference[iG] + "==" + theCompare[iG]);
break;
}
}
}
}
if (LEX <= theCompare.length && LEX >= theReference.length) {
DETECTORP3[P3] = theString[iC][indexID] + " WITH " + theString[iD][indexID];
P3 += 1;
}
}
}
}
}
}
sPrint(DETECTORP1);
sPrint(DETECTORP2);
sPrint(DETECTORP3);
}
}
如何在VBA中执行此操作?
how to do this in VBA?
推荐答案
真的,这取决于你想如何定义严重性级别。这是一种方法,不一定是最好的:使用Levensthein距离。
Really, it depends how you want to define "severity level". Here's one way to do it, not necessarily the best: Use the Levensthein distance.
用单字符属性符号表示您的每个项目,例如
Represent each of your items by a one-character attribute symbol, e.g.
H helmet
K knight
I iron
$ Leather
^ Valros
╔ Plain
¢ Whatever
etc.
然后转换您的物料列表包含代表这些属性的字符序列:
Then convert your Material lists into a string containing sequence of characters representing these attributes:
HIK = helmet,iron,knight
¢H = plain,helmet
然后计算这两个字符串之间的Levenshtein距离。那将是你的严重程度。
Then compute the Levenshtein distance between those two strings. That will be your "severity level".
Debug.Print LevenshteinDistance("HIK","¢H")
'returns 3
Levenshtein距离的两个实现显示在维基百科。实际上你是运气好的:StackOverflow上有人将其移植到VBA 。
Two implementations of the Levenshtein distance are shown in Wikipedia. And indeed you are in luck: someone on StackOverflow ported this to VBA.
在下面的评论部分,你说你不喜欢用单字符符号表示每个可能的属性。很公平;我同意这有点傻。解决方法:事实上,实际上可以调整Levenshtein Distance算法来查找字符串中的每个字符,而不是在数组的每个元素上,并根据此进行比较。我将介绍如何在我的回答中对您的后续问题进行此更改。
In the comments section below, you say you don't like having to represent each of your possible attributes by one-character symbols. Fair enough; I agree this is a bit silly. Workaround: It is, in fact, possible to adapt the Levenshtein Distance algorithm to look not at each character in a string, but at each element of an array instead, and do comparisons based on that. I show how to make this change in my answer to your follow-up question.
这篇关于潜在重复检测,具有3级严重性级别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!