移除“几乎重复"的内容.使用SAS或Excel [英] Removing "almost duplicates" using SAS or Excel

查看:80
本文介绍了移除“几乎重复"的内容.使用SAS或Excel的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在SAS上工作,我有一个包含2列的数据集,我不仅要删除重复项,而且还希望删除几乎"重复项.数据如下:

I am working in SAS and I have a data-set with 2 columns and I want not only to remove the duplicates, but also the "almost" duplicates. The data looks like this:

**Brand        Product**
Coca Cola    Coca Cola Light
Coca Cola    Coca Cola Lgt
Coca Cola    Cocacolalight
Coca Cola    Coca Cola Vanila
  Pepsi       Pepsi Zero
  Pepsi       Pepsi Zro

我不知道这是否真的可能,但是删除重复项"后,我希望文件看起来像什么:

i do not know if it is actually possible, but what I would like the file to look like after removing the "duplicates", is like that:

    **Brand        Product**
    Coca Cola    Coca Cola Light
    Coca Cola    Coca Cola Vanila
      Pepsi       Pepsi Zero

如果决赛桌有只要没有重复"值,就可以使用百事可乐零"或百事可乐Zro".

I don't have a preference if the final table will have e.g. "Pepsi Zero" or "Pepsi Zro" as long as there are no "duplicate" values.

我在想是否有一种方法可以比较例如前4-5个字母,如果相同,则将其视为重复字母.但是我当然愿意提出建议.如果即使在excel中也有一种方法可以解决,我很想听听.

I was thinking if there was a way to compare the e.g. first 4-5 letters and if they are the same then to consider them as duplicates. But of course I am open to suggestions. If there is a way to be done even in excel I would be interested to hear it.

推荐答案

我首先要直接引用Jeff的

I'm going to start by straight up quoting Jeff's answer :

SAS至少具有几个用于计算编辑距离的功能 在两个字符串之间:

SAS has at least a couple functions for calculating edit distance between two strings:

已合并,用于常规编辑距离: http://support. sas.com/documentation/cdl/zh-CN/lrdict/64316/HTML/default/viewer.htm#a002206133.htm

Compged, for general edit distance: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206133.htm

Complev,对于Levenshtein距离: http://support. sas.com/documentation/cdl/zh-CN/lrdict/64316/HTML/default/viewer.htm#a002206137.htm

Complev, for Levenshtein distance: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206137.htm

还有spedis()功能,用于比较编辑距离.

There's also the spedis() function for comparing edit distances.

现在所有这些都很棒,但是我个人最喜欢的是soundex()函数,该函数将允许您测试两个单词听起来"是否相同.并不是100%正确,但是在这种情况下,结果可以正常工作.

Now those are all great, but my personal favorite is the soundex() function which will allow you to test if two words 'sound' the same. It's not going to be 100% correct but in this case the results work alright.

首先提供一些数据:

Data HAVE;
  attrib name length=$20 alt_name length=$20;
  infile datalines dsd dlm=',' truncover;
  input name $ alt_name $;
  datalines;
Coca Cola    ,Coca Cola Light
Coca Cola    ,Coca Cola Lgt
Coca Cola    ,Cocacolalight
Coca Cola    ,Coca Cola Vanila
Pepsi        ,Pepsi Zero
Pepsi        ,Pepsi Zro
;
Run;

获取我们要比较的所有单词组合,并计算soundex()以进行目测:

Get every combination of words that we want to compare, and calculate the soundex()s for eyeballing:

proc sql noprint;
  create table cartesian as
  select a.name,
         a.alt_name as alt_name1,
         b.alt_name as alt_name2,
         soundex(a.alt_name) as soundex_a,
         soundex(b.alt_name) as soundex_b
  from have a, have b
  where a.name = b.name
    and soundex(a.alt_name) eq soundex(b.alt_name)
  ;
quit;

现在,我将其留作练习来对结果列表进行重复数据删除.但这基本上会告诉您哪些单词匹配.如果您发现匹配项的假阳性,只需将它们添加到例外列表中即可手动转换这些特定值.

Now I'll leave it up to use as an exercise to dedupe the resulting list. But basically this will tell you which words match up. If you get false-positives for the matches, just add them to an exception list to manually transform those particular values.

这篇关于移除“几乎重复"的内容.使用SAS或Excel的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆