R模糊字符串匹配根据匹配的字符串返回特定列 [英] R fuzzy string match to return specific column based on matched string

查看:40
本文介绍了R模糊字符串匹配根据匹配的字符串返回特定列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个大型数据集,一个大约 50 万条记录,另一个大约 7 万条记录.这些数据集有地址.我想匹配较小数据集中的任何地址是否存在于大数据集中.正如您所想象的,地址可以以不同的方式和不同的情况/拼写等书写.除此之外,如果只写到建筑物级别,则可以复制此地址.所以不同的单位有相同的地址.我做了一些研究并找出了可以使用的包 stringdist.

I have two large datasets, one around half a million records and the other one around 70K. These datasets have address. I want to match if any of the address in the smaller data set are present in the large one. As you would imagine address can be written in different ways and in different cases / spellings etc. Apart from this address can be duplicated if written only till the building level. So different flats have the same address. I did some research and figured out the package stringdist that can be used.

我做了一些工作并设法根据距离获得最接近的匹配.但是我无法返回地址匹配的相应列.

I did some work and managed to get the closest match based on distance. However I am not able to return the corresponding columns for which the address match.

下面是一个示例虚拟数据以及我为解释情况而创建的代码

Below is a sample dummy data along with code that I have created to explain the situation

library(stringdist)
Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr")
Year1 <- c(2001:2007)

Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)

df1 <- data.table(Address1,Year1)
df2 <- data.table(Address2,Year2)
df2[,unique_id := sprintf("%06d", 1:nrow(df2))]

fn_match = function(str, strVec, n){
  strVec[amatch(str, strVec, method = "dl", maxDist=n,useBytes = T)]
}

df1[!is.na(Address1)
    , address_match := 
      fn_match(Address1, df2$Address2,3)
    ]

这会返回基于 3 距离的封闭字符串匹配,但是我还想在 df1 中包含来自 df2 的Year"和unique_id"列.这将帮助我知道字符串与 df2 中的哪一行数据匹配.所以最后我想知道 df1 中的每一行 根据指定的距离从 df2 最接近的匹配是什么,并且对于匹配的行有特定的 "Year""unique_id" 来自 df2.

This returns me the closed string match based on distance of 3, however I wanted to also have columns of "Year" and "unique_id" from df2 in df1. This would help me to know with which row of data the string was matched from df2. So finally I want to know for each row in df1 what was the closet match from df2 based on the distance specified and have for the matching rows the specific "Year" and "unique_id" from df2.

我想这与合并(左连接)有关,但我不确定如何合并保留重复项并确保我的行数与 df1(小数据集)中的行数相同.

I guess there is something to do with merge (left join), but I am not sure how I can merge keeping the duplicates and ensuring that I have same number of rows as in df1 (small data set).

任何类型的解决方案都会有所帮助!

Any kind of solution would help!!

推荐答案

你已经完成了 90%...

You are 90% of the way there...

你说你想

知道字符串与df2中的哪一行数据相匹配

know with which row of data the string was matched from df2

您只需要了解您已有的代码即可.见 ?amatch:

You just need to understand the code you already have. See ?amatch:

amatch 返回 xtable 中最接近匹配的位置.当存在多个具有相同最小距离度量的匹配项时,返回第一个.

amatch returns the position of the closest match of x in table. When multiple matches with the same smallest distance metric exist, the first one is returned.

换句话说,amatch 为您提供 df2(这是您的 table)中最接近匹配的行的索引df1 中的每个地址(这是您的 x).您通过返回新地址来过早地包装此索引.

In other words, amatch gives you the index for the row in df2 (which is your table) that is the closest match of each address in df1 (which is your x). You are prematurely wrapping this index by returning the new address instead.

相反,检索索引本身以进行查找为左连接检索 unique_id(如果您确信它确实是唯一 id).

Instead, retrieve either the index itself for lookup or the unique_id (if you are confident that it is truly a unique id) for a left join.

两种方法的说明:

library(data.table) # you forgot this in your example
library(stringdist)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
                  Year1 = 2001:2007) # already a vector, no need to combine
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
                  Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)] # use .I, it's neater

# Return position from strVec of closest match to str
match_pos = function(str, strVec, n){
  amatch(str, strVec, method = "dl", maxDist=n,useBytes = T) # are you sure you want useBytes = TRUE?
}

# Option 1: use unique_id as a key for left join
df1[!is.na(Address1) | nchar(Address1>0), # I would exclude only on NA_character_ but also empty string, perhaps string of length < 3
    unique_id := df2$unique_id[match_pos(Address1, df2$Address2,3)] ]
merge(df1, df2, by='unique_id', all.x=TRUE) # see ?merge for more options

# Option 2: use the row index
df1[!is.na(Address1) | nchar(Address1>0),
    df2_pos := match_pos(Address1, df2$Address2,3) ] 
df1[!is.na(df2_pos), (c('Address2','Year2','UniqueID')):=df2[df2_pos,.(Address2,Year2,unique_id)] ][]

这篇关于R模糊字符串匹配根据匹配的字符串返回特定列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆