R模糊字符串匹配以基于匹配的字符串返回特定的列 [英] R fuzzy string match to return specific column based on matched string

查看:408
本文介绍了R模糊字符串匹配以基于匹配的字符串返回特定的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个大型数据集,一个大约50万条记录,另一个大约70K.这些数据集具有地址.我想匹配较小数据集中的任何地址是否存在于较大的数据集中.就像您想象的那样,地址可以用不同的方式以及在不同的情况/拼写等方式中写入.除此地址外,如果仅在建筑物级别之前写入,则可以重复.因此,不同的单位具有相同的地址.我做了一些研究,弄清楚了可以使用的stringdist软件包.

I have two large datasets, one around half a million records and the other one around 70K. These datasets have address. I want to match if any of the address in the smaller data set are present in the large one. As you would imagine address can be written in different ways and in different cases / spellings etc. Apart from this address can be duplicated if written only till the building level. So different flats have the same address. I did some research and figured out the package stringdist that can be used.

我做了一些工作,并设法根据距离获得了最接近的匹配.但是,我无法返回地址匹配的相应列.

I did some work and managed to get the closest match based on distance. However I am not able to return the corresponding columns for which the address match.

下面是示例虚拟数据以及我创建的用于解释这种情况的代码

Below is a sample dummy data along with code that I have created to explain the situation

library(stringdist)
Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr")
Year1 <- c(2001:2007)

Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)

df1 <- data.table(Address1,Year1)
df2 <- data.table(Address2,Year2)
df2[,unique_id := sprintf("%06d", 1:nrow(df2))]

fn_match = function(str, strVec, n){
  strVec[amatch(str, strVec, method = "dl", maxDist=n,useBytes = T)]
}

df1[!is.na(Address1)
    , address_match := 
      fn_match(Address1, df2$Address2,3)
    ]

这将基于距离3返回封闭的字符串匹配,但是我想在df1中也包含df2中的"Year"和"unique_id"列.这将有助于我了解df2中的字符串与哪一行数据匹配.因此,最后我想针对 df1中的每一行 ,根据指定的距离与df2的壁橱匹配是什么,并为匹配的行提供特定的 df2中的年份" "unique_id" .

This returns me the closed string match based on distance of 3, however I wanted to also have columns of "Year" and "unique_id" from df2 in df1. This would help me to know with which row of data the string was matched from df2. So finally I want to know for each row in df1 what was the closet match from df2 based on the distance specified and have for the matching rows the specific "Year" and "unique_id" from df2.

我猜想与合并(左连接)有关,但是我不确定如何合并并保留重复项并确保与df1(小型数据集)中的行数相同.

I guess there is something to do with merge (left join), but I am not sure how I can merge keeping the duplicates and ensuring that I have same number of rows as in df1 (small data set).

任何一种解决方案都可以帮助您!

Any kind of solution would help!!

推荐答案

您已经完成了90%的工作...

You are 90% of the way there...

您说要

从df2知道字符串与哪一行数据匹配

know with which row of data the string was matched from df2

您只需要了解已有的代码即可.参见?amatch:

You just need to understand the code you already have. See ?amatch:

amatch返回table中与x最匹配的位置.当存在多个具有相同最小距离度量的匹配时,将返回第一个.

amatch returns the position of the closest match of x in table. When multiple matches with the same smallest distance metric exist, the first one is returned.

换句话说,amatch为您提供df2(即您的table)中该行的索引,该索引是df1(即您的x)中每个地址的最接近匹配项.您通过返回新地址来过早包装此索引.

In other words, amatch gives you the index for the row in df2 (which is your table) that is the closest match of each address in df1 (which is your x). You are prematurely wrapping this index by returning the new address instead.

相反,请检索索引本身以进行查询以进行左联接的unique_id(如果您确信它确实是唯一ID).

Instead, retrieve either the index itself for lookup or the unique_id (if you are confident that it is truly a unique id) for a left join.

两种方法的说明:

library(data.table) # you forgot this in your example
library(stringdist)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
                  Year1 = 2001:2007) # already a vector, no need to combine
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
                  Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)] # use .I, it's neater

# Return position from strVec of closest match to str
match_pos = function(str, strVec, n){
  amatch(str, strVec, method = "dl", maxDist=n,useBytes = T) # are you sure you want useBytes = TRUE?
}

# Option 1: use unique_id as a key for left join
df1[!is.na(Address1) | nchar(Address1>0), # I would exclude only on NA_character_ but also empty string, perhaps string of length < 3
    unique_id := df2$unique_id[match_pos(Address1, df2$Address2,3)] ]
merge(df1, df2, by='unique_id', all.x=TRUE) # see ?merge for more options

# Option 2: use the row index
df1[!is.na(Address1) | nchar(Address1>0),
    df2_pos := match_pos(Address1, df2$Address2,3) ] 
df1[!is.na(df2_pos), (c('Address2','Year2','UniqueID')):=df2[df2_pos,.(Address2,Year2,unique_id)] ][]

这篇关于R模糊字符串匹配以基于匹配的字符串返回特定的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆