R 在向量中查找相互匹配的元素 [英] R Finding elements matching with each other within a vector

查看:42
本文介绍了R 在向量中查找相互匹配的元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个地址列表.这些地址是由不同的用户输入的,因此写入相同地址的方式存在很大差异.例如,

I have a list of addresses. These addresses were input by various users and hence there are lot of differences in the way a same address is written. For example,

"andheri at weh pump house", "andheri pump house","andheri pump house(mt)","weh andheri pump house","weh andheri pump house et","weh, nr. pump house" 

上面的向量有 6 个地址.而且几乎所有的都是一样的.我试图找到这些地址之间的匹配项,以便我可以将它们放在一起并重新编码.

The above vector has 6 addresses. And almost all of them are the same. I am trying to find the matches between these address, so that I can club them together and recode them.

我尝试过使用 agrep 和 stringdist 包.使用 agrep 我不确定我是否应该将每个地址作为一个模式并将其与其余的进行匹配.从 stringdist 包中,我执行了以下操作:

I have tried using agrep and stringdist package. With agrep I am not sure if I should each address as a pattern and match it against the rest. And from the stringdist package I did the following:

library(stringdist)
nsrpatt <- df$Address
x <- scan(what=character(), text = nsrpatt, sep=",")
x <- x[trimws(x)!= ""]
y <- ave(x, phonetic(x), FUN = function(.x) .x[1])

以上给了我错误:

In phonetic(x) : soundex encountered 111 non-printable ASCII or non-ASCII
  characters. 

不确定是否应该从字符向量中删除这些元素或将它们转换为其他格式.

Not sure if I should remove those elements from the character vector or convert them to some other format.

我尝试过使用 agrep:

With agrep I tried:

for (i in 1:length(nsrpattn)) {
  npat <- agrep(nsrpattn[i], df$address, max=1, v=T)
}

字符向量的长度约为 25000,这会一直运行并导致机器停顿.

The length of the character vector is around 25000 and this keeps running and stalls the machine.

如何有效地为每个地址找到最接近的匹配项.

How do I effectively find the closest match for each one of the address.

推荐答案

您可以对数据进行小型聚类分析.

You could run a small cluster analysis on your data.

x <- c("wall street", "Wall-street", "Wall ST", "andheri pump house", 
       "weh, nr. pump house", "Wallstreet", "weh andheri pump house", 
       "Wall Street", "weh andheri pump house et", "andheri at weh pump house", 
       "andheri pump house(mt)")

首先,你需要一个距离矩阵.

First, you need a distance matrix.

# Levenstein Distance
e  <- adist(na.omit(tolower(x)))
rownames(e) <- na.omit(x)

然后,可以运行聚类分析.

Then, a cluster analysis can be run.

hc <- hclust(as.dist(e))  # find distance clusters

导出最佳切点,例如以图形方式,并砍树".

Derive the best cutpoint, e.g. graphically, and "cut the tree".

plot(hc)

# cut tree at specific cluster size, i.e. getting codes of similar objects
smly <- cutree(hc, h=16)

然后你可以构建一个关键数据框,你可以检查匹配是否正确.

Then you can build a key data frame, which which you can check if the matches are right.

key <- data.frame(x=na.omit(x), 
                  smly=factor(smly, labels=c("Wall Street", "Andheri Pump House")),
                  row.names=NULL)  # key data frame
key
#                            x               smly
# 1                wall street        Wall Street
# 2                Wall-street        Wall Street
# 3                    Wall ST        Wall Street
# 4         andheri pump house Andheri Pump House
# 5        weh, nr. pump house Andheri Pump House
# 6                 Wallstreet        Wall Street
# 7     weh andheri pump house Andheri Pump House
# 8                Wall Street        Wall Street
# 9  weh andheri pump house et Andheri Pump House
# 10 andheri at weh pump house Andheri Pump House
# 11    andheri pump house(mt) Andheri Pump House

最后像这样替换你的向量:

Finally replace your vector like so:

x <- key$smly

这篇关于R 在向量中查找相互匹配的元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆