匹配R中多个“脏"列中的两个数据集 [英] Match two datasets across multiple ‘dirty’ columns in R

查看:124
本文介绍了匹配R中多个“脏"列中的两个数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

出于两个原因,我经常需要通过多个匹配列来匹配两个数据集.首先,这些特征中的每一个都是肮脏的",这意味着即使在应有的情况下,单个列也并不一致(对于真正匹配的行).其次,特征不是唯一的(例如,男性和女性).这样的匹配对于跨时间(测试前和测试后分数),不同的数据模式(观察到的特征和实验室值)或研究参与者的多个数据集进行匹配非常有用.

我需要选择最佳匹配的启发式方法. 然后,如上所述,我可以一起对两者进行分析在这个问题中.请注意,有许多匹配的列和许多ID,因此必须将它们都指定为列表或向量.例如,我在下面创建了两个数据集以进行匹配.在此示例中,即使只有"match4"列匹配,DT1第1行(ID 1)也是DT2第1行(ID 55)的最佳匹配.这是因为DT2第2行和第3行与DT1第2行和第3行更好地匹配.奖励:DT1第7行与DT2第7行和第8行相等,但是与DT2第7行具有部分匹配,因此理想情况下应选择. >

问题:对于DT1,请为DT2中的匹配行选择一个最佳猜测",并仅使用DT2中的每一行.在R中执行此操作的最佳方法是什么(以一种有效的最佳实践"惯用方式)?

我的初步方法: 我创建了第三个data.table,其中包含来自DT1的ID列,称为DTmatch.所有后续列将是DT2中的ID.对于DTmatch的第二列(以DT2的第一个ID命名),每个值都应代表匹配列的计数(在此示例中为0到4).接下来,在每一行和每一列唯一的匹配表中找到最高匹配值.最后,创建最后一列,指定与DT1 ID匹配的DT2 ID(DTmatch中的第1列).

library(data.table)
# In this example, the datasets are matched by row number, but the real data is not.
DT1 = data.table(
  ID = 1:7,
  match1 = c("b","b","b","a","a","c",NA),
  match2 = c(7, 8, 9, NA, NA, NA, NA),
  match3 = c(0, 0, 0, "j", 13:15),
  match4 = c(rep("m", 4), rep("f", 3)),
  value1 = 45:51,
  value2 = 100:106
)

DT2 = data.table(
  ID = 55:62,
  match1 = c("b","b",4,"a","a","c","j","j"),
  match2 = c(77, 8:14),
  match3 = c(9:14, 155, 16),
  match4 = c(rep("m", 4), NA, rep("f", 3)),
  value1 = 145:152,
  value2 = 101:108
)

# Fix numeric IDs
DT1[, ID := make.names(ID)]
DT2[, ID := make.names(ID)]

# Make new matching table
DTmatch <- DT1[, .(make.names(ID))]
setnames(DTmatch, old = "V1", new = "DT1ID")

# Start with one ID and one matching column
DT2ID <- DT2$ID[1]
DTmatch[, (DT2ID) := 0]
matchingCols <- c("match1")

# Code for first ID and match1, to be adapted for all IDs and all columns
DTmatch[, (DT2ID) := eval(parse(text=DT2ID)) + as.numeric(DT1[, (matchingCols), with=F] == DT2[ID==DT2ID, matchingCols, with=F][[1]])]

# First attempt at matching doesn't work due to NAs
for (thisID in DT2$ID) {
  DTmatch[, (thisID) := 0]
  for (matchingCol in matchingCols) {
#    if (!is.na(DT1[, matchingCol, with=F]) & !is.na(DT2[ID==thisID, matchingCol, with=F])) {
      DTmatch[, (thisID) := eval(parse(text=thisID)) + as.numeric(DT1[, (matchingCol), with=F] == DT2[ID==thisID, matchingCol, with=F][[1]])]
#    }
  }
}

解决方案

也许这是一个开始的选项:

首先,通过将匹配列中的所有值粘贴在一起创建一个新列

#create new column based on matching cols
DT1[, col_join := do.call( paste, c(.SD, sep="")), .SDcols= match1:match4][]
DT2[, col_join := do.call( paste, c(.SD, sep="")), .SDcols= match1:match4][]

然后,使用fuzzyjoin -package,您可以基于字符串距离执行连接. 下面,最大距离设置为2.因此,如果在2的距离内找不到匹配的字符串,则连接的结果将为<NA>.
您可以/应该尝试不同的stringdist方法和最大距离...

library(fuzzyjoin)
result <- stringdist_join( DT2, DT1, 
                           by = "col_join", 
                           max_dist = 2, 
                           mode = "left", 
                           distance_col = "string_distance" )

result[,c(1,8,9,16,17)][]
# ID.x col_join.x ID.y col_join.y string_distance
# 1:   55      b779m    1       b70m               2
# 2:   56      b810m    1       b70m               2
# 3:   56      b810m    2       b80m               1
# 4:   56      b810m    3       b90m               2
# 5:   57      4911m   NA       <NA>              NA
# 6:   58     a1012m   NA       <NA>              NA
# 7:   59    a1113NA   NA       <NA>              NA
# 8:   60     c1214f    6     cNA14f               2
# 9:   61    j13155f   NA       <NA>              NA
# 10:   62     j1416f   NA       <NA>              NA

如您所见,您仍然需要弄清楚一些东西,例如如何处理NA值".
在我看来,使用模糊连接总是会涉及很多错误.很多时候,您将不得不接受完美答案"只是在那里...

I frequently need to match two datasets by multiple matching columns, for two reasons. First, each of these characteristics are ‘dirty’, meaning a single column does not consistently match even when it should (for a truly matching row). Second, the characteristics are not unique (e.g., male and female). Matching like this is useful for matching across time (pre-test with post-test scores), different data modalities (observed characteristics and lab values), or multiple datasets for research participants.

I need a heuristic that selects the best match. Then I can perform analyses of the two together, as described in this question. Note there are many matching columns, and many IDs, so they must both be specified as a list or vector. As an example, I have created two datasets below to match. In the example, DT1 row 1 (ID 1) is the best match for DT2 row 1 (ID 55), even though only the ‘match4’ column matches; this is because DT2 rows 2 and 3 are better matches for DT1 rows 2 and 3. Bonus: DT1 row 7 equally matches DT2 rows 7 and 8, but has a partial match to DT2 row 7, so ideally that would be selected.

Question: For DT1, select a "best guess" for the matching row from DT2, and use each row from DT2 only once. What is the best way to do this (in an efficient and "best practices" idiomatic way) in R?

My preliminary approach: I created a third data.table with a column of IDs from DT1, called DTmatch. All subsequent columns will be IDs from DT2. For the second column of DTmatch (named after the first ID of DT2), each value should represent the count of matching columns (in this example, 0 to 4). Next, find the highest match values in the matching table unique to each row and column. Lastly, create a final column that specifies the DT2 ID that matches the DT1 ID (column 1 in DTmatch).

library(data.table)
# In this example, the datasets are matched by row number, but the real data is not.
DT1 = data.table(
  ID = 1:7,
  match1 = c("b","b","b","a","a","c",NA),
  match2 = c(7, 8, 9, NA, NA, NA, NA),
  match3 = c(0, 0, 0, "j", 13:15),
  match4 = c(rep("m", 4), rep("f", 3)),
  value1 = 45:51,
  value2 = 100:106
)

DT2 = data.table(
  ID = 55:62,
  match1 = c("b","b",4,"a","a","c","j","j"),
  match2 = c(77, 8:14),
  match3 = c(9:14, 155, 16),
  match4 = c(rep("m", 4), NA, rep("f", 3)),
  value1 = 145:152,
  value2 = 101:108
)

# Fix numeric IDs
DT1[, ID := make.names(ID)]
DT2[, ID := make.names(ID)]

# Make new matching table
DTmatch <- DT1[, .(make.names(ID))]
setnames(DTmatch, old = "V1", new = "DT1ID")

# Start with one ID and one matching column
DT2ID <- DT2$ID[1]
DTmatch[, (DT2ID) := 0]
matchingCols <- c("match1")

# Code for first ID and match1, to be adapted for all IDs and all columns
DTmatch[, (DT2ID) := eval(parse(text=DT2ID)) + as.numeric(DT1[, (matchingCols), with=F] == DT2[ID==DT2ID, matchingCols, with=F][[1]])]

# First attempt at matching doesn't work due to NAs
for (thisID in DT2$ID) {
  DTmatch[, (thisID) := 0]
  for (matchingCol in matchingCols) {
#    if (!is.na(DT1[, matchingCol, with=F]) & !is.na(DT2[ID==thisID, matchingCol, with=F])) {
      DTmatch[, (thisID) := eval(parse(text=thisID)) + as.numeric(DT1[, (matchingCol), with=F] == DT2[ID==thisID, matchingCol, with=F][[1]])]
#    }
  }
}

解决方案

Perhaps this is an option to start with:

first, create a new column, by pasting all values from the match-columns together

#create new column based on matching cols
DT1[, col_join := do.call( paste, c(.SD, sep="")), .SDcols= match1:match4][]
DT2[, col_join := do.call( paste, c(.SD, sep="")), .SDcols= match1:match4][]

Then, using the fuzzyjoin-package, you can perform a join based on string-distance. Below, the maximum distance is set to 2. So if no matching string is found within a distance of 2, the result of the join will be <NA>.
You can/should experiment with the different stringdist-methods and the maximum distance...

library(fuzzyjoin)
result <- stringdist_join( DT2, DT1, 
                           by = "col_join", 
                           max_dist = 2, 
                           mode = "left", 
                           distance_col = "string_distance" )

result[,c(1,8,9,16,17)][]
# ID.x col_join.x ID.y col_join.y string_distance
# 1:   55      b779m    1       b70m               2
# 2:   56      b810m    1       b70m               2
# 3:   56      b810m    2       b80m               1
# 4:   56      b810m    3       b90m               2
# 5:   57      4911m   NA       <NA>              NA
# 6:   58     a1012m   NA       <NA>              NA
# 7:   59    a1113NA   NA       <NA>              NA
# 8:   60     c1214f    6     cNA14f               2
# 9:   61    j13155f   NA       <NA>              NA
# 10:   62     j1416f   NA       <NA>              NA

As you can see, you will still have to figure out some stuff, like "what to do with NA-values".
With Fuzzy joining there is always (in my opinion) a lot of trail-and-error involved. And a lot of times you will have to accept that 'the perfect answer' is just not out there...

这篇关于匹配R中多个“脏"列中的两个数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆