潜在重复检测,具有 3 个严重级别 [英] Potential Duplicates Detection, with 3 Severity Level
问题描述
我想制作一个程序来检测具有 3 个严重级别的潜在重复项.让我们考虑一下我的数据只有两列,但有数千行.第二列中的数据仅用逗号分隔.数据示例:
I wanna make a program that detect a potential duplicates with 3 severity level. let consider my data is only in two column, but with thousands row. data in second column delimited only with comma. data example :
Number | Material
1 | helmet,valros,42
2 | helmet,iron,knight
3 | valros,helmet,42
4 | knight,helmet
5 | valros,helmet,42
6 | plain,helmet
7 | helmet, leather
我的 3 个级别是:
非常高: A,B,C vs A,B,C
very high : A,B,C vs A,B,C
高: A,B,C vs B,C,A
high : A,B,C vs B,C,A
一般: A,B,C vs A,B
so so : A,B,C vs A,B
目前我只能做第一关,我不知道如何做第二和第三关.
so far i just can make the first level, i don't know how to do the 2nd and the 3rd level.
我尝试过的.
Sub duplicates_separation()
Dim duplicate(), i As Long
Dim delrange As Range, cell As Long
Dim shtIn As Worksheet, shtOut As Worksheet
Set shtIn = ThisWorkbook.Sheets("input")
Set shtOut = ThisWorkbook.Sheets("output")
x = 2
y = 1
Set delrange = shtIn.Range("b1:b10000") 'set your range here
ReDim duplicate(0)
'search duplicates in 2nd column
For cell = 1 To delrange.Cells.Count
If Application.CountIf(delrange, delrange(cell)) > 1 Then
ReDim Preserve duplicate(i)
duplicate(i) = delrange(cell).Address
i = i + 1
End If
Next
'print duplicates
For i = UBound(duplicate) To LBound(duplicate) Step -1
shtOut.Cells(x, 1).EntireRow.Value = shtIn.Range(duplicate(i)).EntireRow.Value
End Sub
程序检测到的重复项:
3 | valros,helmet,42
5 | valros,helmet,42
我的预期:
Number | Material
1 | helmet,valros,42
3 | valros,helmet,42
5 | valros,helmet,42
4 | knight,helmet
2 | helmet,iron,knight
我有一个检测重复 lv 2 的想法,但我认为它会很复杂,并且会使程序变慢.
i have an idea for detecting duplicates lv 2, but I think it will be so complicated and make the program slow.
- 使用文本到列"命令将第 2 列转换为列
- 从 A 到 Z 对列进行排序(按字母顺序)
- 连接列
- 在检测重复 lv 1 时像做一样计数
有没有办法检测第二个 &第三级重复?
is there a way to detect the 2nd & 3rd level duplicates?
更新
昨天我去朋友家咨询这个问题,但是他的解决方案是JAVA语言..>我不明白
Yesterday I went to friend's house to consult about this problem, but his solution is in JAVA languange.. >which i don't understand
public class ali {
static void sPrint(String[] Printed) {
for (int iC = 0; iC < Printed.length; iC++) {
System.out.print(String.valueOf(Printed[iC]) + " | ");
}
System.out.println();
}
public static void main(String Args[]) {
int defaultLength = 10;
int indexID = 0;
int indexDesc = 1;
String[] DETECTORP1 = new String[defaultLength];
String[] DETECTORP2 = new String[defaultLength];
String[] DETECTORP3 = new String[defaultLength];
String[] DETECTORP4 = new String[defaultLength];
String[][] theString = new String[5][2];
theString[0] = new String[]{"1", "A, B, C, D"};
theString[1] = new String[]{"2", "A, B, C, D"};
theString[2] = new String[]{"3", "A, B, C, D, E"};
theString[3] = new String[]{"4", "A, B, D, C, E"};
theString[4] = new String[]{"5", "A, B, D, C, E, F"};
int P1 = 0;
int P2 = 0;
int P3 = 0;
int P4 = 0;
for (int iC = 0; iC < theString.length; iC++) {
System.out.println(theString[iC][indexID] + " -> " + theString[iC][indexDesc]);
}
for (int iC = 0; iC < theString.length; iC++) {
int LEX;
String theReference[] = theString[iC][indexDesc].replace(",", ";;").split(";;");
for (int iD = 0; iD < theString.length; iD++) {
if (iC != iD) {
String theCompare[] = theString[iD][1].replace(",", ";;").split(";;");
if (theReference.length == theCompare.length) {
LEX=0;
int theLength = theReference.length;
for (int iE = 0; iE < theLength; iE++) {
if (theReference[iE].equals(theCompare[iE])) {
LEX += 1;
}
}
if (LEX == theLength) {
DETECTORP1[P1] = theString[iC][indexID] + " WITH " + theString[iD][indexID];
P1 += 1;
} else {
LEX = 0;
for (int iF = 0; iF < theReference.length; iF++) {
for (int iG = 0; iG < theCompare.length; iG++) {
if (theReference[iF].equals(theCompare[iG])) {
LEX += 1;
break;
}
}
}
if (LEX == theReference.length) {
DETECTORP2[P2] = theString[iC][indexID] + " WITH " + theString[iD][indexID];
P2 += 1;
}
}
} else {
LEX = 0;
if (theReference.length > theCompare.length) {
for (int iF = 0; iF < theReference.length; iF++) {
for (int iG = 0; iG < theCompare.length; iG++) {
if (iG == iF) {
if (theReference[iF].equals(theCompare[iF])) {
LEX += 1;
break;
}
}
}
}
if (LEX <= theReference.length && LEX >= theCompare.length) {
DETECTORP3[P3] = theString[iC][indexID] + " WITH " + theString[iD][indexID];
P3 += 1;
}
} else {
LEX =0;
for (int iF = 0; iF < theCompare.length; iF++) {
for (int iG = 0; iG < theReference.length; iG++) {
if (iG == iF) {
if (theCompare[iF].equals(theReference[iF])) {
LEX += 1;
// System.out.println(theReference[iG] + "==" + theCompare[iG]);
break;
}
}
}
}
if (LEX <= theCompare.length && LEX >= theReference.length) {
DETECTORP3[P3] = theString[iC][indexID] + " WITH " + theString[iD][indexID];
P3 += 1;
}
}
}
}
}
}
sPrint(DETECTORP1);
sPrint(DETECTORP2);
sPrint(DETECTORP3);
}
}
如何在 VBA 中做到这一点?
how to do this in VBA?
推荐答案
真的,这取决于您要如何定义严重性级别".这是一种方法,不一定是最好的:使用 Levensthein 距离.
Really, it depends how you want to define "severity level". Here's one way to do it, not necessarily the best: Use the Levensthein distance.
用一个字符的属性符号来表示您的每件商品,例如
Represent each of your items by a one-character attribute symbol, e.g.
H helmet
K knight
I iron
$ Leather
^ Valros
╔ Plain
¢ Whatever
etc.
然后将您的材质列表转换为包含代表这些属性的字符序列的字符串:
Then convert your Material lists into a string containing sequence of characters representing these attributes:
HIK = helmet,iron,knight
¢H = plain,helmet
然后计算这两个字符串之间的 Levenshtein 距离.那将是您的严重程度".
Then compute the Levenshtein distance between those two strings. That will be your "severity level".
Debug.Print LevenshteinDistance("HIK","¢H")
'returns 3
维基百科中显示了 Levenshtein 距离的两种实现.确实你很幸运:StackOverflow 上的某个人将此移植到了 VBA.
Two implementations of the Levenshtein distance are shown in Wikipedia. And indeed you are in luck: someone on StackOverflow ported this to VBA.
在下面的评论部分,您说您不喜欢用一个字符的符号来表示每个可能的属性.很公平;我同意这有点傻.解决方法:事实上,可以调整 Levenshtein 距离算法,使其不查看字符串中的每个字符,而是查看数组的每个元素,并在此基础上进行比较.我将在 我的回答 中对您的后续问题展示如何进行此更改.
In the comments section below, you say you don't like having to represent each of your possible attributes by one-character symbols. Fair enough; I agree this is a bit silly. Workaround: It is, in fact, possible to adapt the Levenshtein Distance algorithm to look not at each character in a string, but at each element of an array instead, and do comparisons based on that. I show how to make this change in my answer to your follow-up question.
这篇关于潜在重复检测,具有 3 个严重级别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!