删除“几乎重复的"使用 SAS 或 Excel [英] Removing "almost duplicates" using SAS or Excel

查看:22
本文介绍了删除“几乎重复的"使用 SAS 或 Excel的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 SAS 工作,我有一个包含 2 列的数据集,我不仅想删除重复项,还想删除几乎"重复项.数据如下所示:

I am working in SAS and I have a data-set with 2 columns and I want not only to remove the duplicates, but also the "almost" duplicates. The data looks like this:

**Brand        Product**
Coca Cola    Coca Cola Light
Coca Cola    Coca Cola Lgt
Coca Cola    Cocacolalight
Coca Cola    Coca Cola Vanila
  Pepsi       Pepsi Zero
  Pepsi       Pepsi Zro

我不知道这是否真的可能,但我希望文件在删除重复项"后看起来像这样:

i do not know if it is actually possible, but what I would like the file to look like after removing the "duplicates", is like that:

    **Brand        Product**
    Coca Cola    Coca Cola Light
    Coca Cola    Coca Cola Vanila
      Pepsi       Pepsi Zero

如果决赛桌有例如,我没有偏好.Pepsi Zero"或Pepsi Zro",只要没有重复"值即可.

I don't have a preference if the final table will have e.g. "Pepsi Zero" or "Pepsi Zro" as long as there are no "duplicate" values.

我在想是否有办法比较例如前 4-5 个字母,如果它们相同,则将它们视为重复.但我当然愿意接受建议.如果有办法在 excel 中完成,我很想听听.

I was thinking if there was a way to compare the e.g. first 4-5 letters and if they are the same then to consider them as duplicates. But of course I am open to suggestions. If there is a way to be done even in excel I would be interested to hear it.

推荐答案

我将直接引用 Jeff 的 回答:

I'm going to start by straight up quoting Jeff's answer :

SAS 至少有几个用于计算编辑距离的函数两个字符串之间:

SAS has at least a couple functions for calculating edit distance between two strings:

Compged,一般编辑距离:http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206133.htm

Compged, for general edit distance: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206133.htm

Complev,对于 Levenshtein 距离:http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206137.htm

Complev, for Levenshtein distance: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206137.htm

还有用于比较编辑距离的 spedis() 函数.

There's also the spedis() function for comparing edit distances.

现在这些都很棒,但我个人最喜欢的是 soundex() 函数,它可以让您测试两个单词的发音"是否相同.它不会 100% 正确,但在这种情况下,结果正常.

Now those are all great, but my personal favorite is the soundex() function which will allow you to test if two words 'sound' the same. It's not going to be 100% correct but in this case the results work alright.

首先是一些数据:

Data HAVE;
  attrib name length=$20 alt_name length=$20;
  infile datalines dsd dlm=',' truncover;
  input name $ alt_name $;
  datalines;
Coca Cola    ,Coca Cola Light
Coca Cola    ,Coca Cola Lgt
Coca Cola    ,Cocacolalight
Coca Cola    ,Coca Cola Vanila
Pepsi        ,Pepsi Zero
Pepsi        ,Pepsi Zro
;
Run;

获取我们要比较的每个单词组合,并计算 soundex()s 以进行目测:

Get every combination of words that we want to compare, and calculate the soundex()s for eyeballing:

proc sql noprint;
  create table cartesian as
  select a.name,
         a.alt_name as alt_name1,
         b.alt_name as alt_name2,
         soundex(a.alt_name) as soundex_a,
         soundex(b.alt_name) as soundex_b
  from have a, have b
  where a.name = b.name
    and soundex(a.alt_name) eq soundex(b.alt_name)
  ;
quit;

现在我将把它作为一个练习来对结果列表进行重复数据删除.但基本上这会告诉你哪些词匹配.如果您得到匹配的误报,只需将它们添加到例外列表中以手动转换这些特定值.

Now I'll leave it up to use as an exercise to dedupe the resulting list. But basically this will tell you which words match up. If you get false-positives for the matches, just add them to an exception list to manually transform those particular values.

这篇关于删除“几乎重复的"使用 SAS 或 Excel的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆