删除“几乎重复的"使用 SAS 或 Excel [英] Removing "almost duplicates" using SAS or Excel
问题描述
我在 SAS 工作,我有一个包含 2 列的数据集,我不仅想删除重复项,还想删除几乎"重复项.数据如下所示:
I am working in SAS and I have a data-set with 2 columns and I want not only to remove the duplicates, but also the "almost" duplicates. The data looks like this:
**Brand Product**
Coca Cola Coca Cola Light
Coca Cola Coca Cola Lgt
Coca Cola Cocacolalight
Coca Cola Coca Cola Vanila
Pepsi Pepsi Zero
Pepsi Pepsi Zro
我不知道这是否真的可能,但我希望文件在删除重复项"后看起来像这样:
i do not know if it is actually possible, but what I would like the file to look like after removing the "duplicates", is like that:
**Brand Product**
Coca Cola Coca Cola Light
Coca Cola Coca Cola Vanila
Pepsi Pepsi Zero
如果决赛桌有例如,我没有偏好.Pepsi Zero"或Pepsi Zro",只要没有重复"值即可.
I don't have a preference if the final table will have e.g. "Pepsi Zero" or "Pepsi Zro" as long as there are no "duplicate" values.
我在想是否有办法比较例如前 4-5 个字母,如果它们相同,则将它们视为重复.但我当然愿意接受建议.如果有办法在 excel 中完成,我很想听听.
I was thinking if there was a way to compare the e.g. first 4-5 letters and if they are the same then to consider them as duplicates. But of course I am open to suggestions. If there is a way to be done even in excel I would be interested to hear it.
推荐答案
我将直接引用 Jeff 的 回答:
I'm going to start by straight up quoting Jeff's answer :
SAS 至少有几个用于计算编辑距离的函数两个字符串之间:
SAS has at least a couple functions for calculating edit distance between two strings:
Compged,一般编辑距离:http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206133.htm
Compged, for general edit distance: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206133.htm
Complev,对于 Levenshtein 距离:http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206137.htm
Complev, for Levenshtein distance: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206137.htm
还有用于比较编辑距离的 spedis()
函数.
There's also the spedis()
function for comparing edit distances.
现在这些都很棒,但我个人最喜欢的是 soundex()
函数,它可以让您测试两个单词的发音"是否相同.它不会 100% 正确,但在这种情况下,结果正常.
Now those are all great, but my personal favorite is the soundex()
function which will allow you to test if two words 'sound' the same. It's not going to be 100% correct but in this case the results work alright.
首先是一些数据:
Data HAVE;
attrib name length=$20 alt_name length=$20;
infile datalines dsd dlm=',' truncover;
input name $ alt_name $;
datalines;
Coca Cola ,Coca Cola Light
Coca Cola ,Coca Cola Lgt
Coca Cola ,Cocacolalight
Coca Cola ,Coca Cola Vanila
Pepsi ,Pepsi Zero
Pepsi ,Pepsi Zro
;
Run;
获取我们要比较的每个单词组合,并计算 soundex()
s 以进行目测:
Get every combination of words that we want to compare, and calculate the soundex()
s for eyeballing:
proc sql noprint;
create table cartesian as
select a.name,
a.alt_name as alt_name1,
b.alt_name as alt_name2,
soundex(a.alt_name) as soundex_a,
soundex(b.alt_name) as soundex_b
from have a, have b
where a.name = b.name
and soundex(a.alt_name) eq soundex(b.alt_name)
;
quit;
现在我将把它作为一个练习来对结果列表进行重复数据删除.但基本上这会告诉你哪些词匹配.如果您得到匹配的误报,只需将它们添加到例外列表中以手动转换这些特定值.
Now I'll leave it up to use as an exercise to dedupe the resulting list. But basically this will tell you which words match up. If you get false-positives for the matches, just add them to an exception list to manually transform those particular values.
这篇关于删除“几乎重复的"使用 SAS 或 Excel的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!