R 合并数据帧,允许不精确的 ID 匹配(例如,附加字符 1234 匹配 ab1234) [英] R merge data frames, allow inexact ID matching (e.g. with additional characters 1234 matches ab1234 )
问题描述
我正在尝试处理一些非常混乱的数据.我需要通过样本 ID 合并两个包含不同类型数据的大数据框.问题是一张表的样本 ID 有许多不同的格式,但大多数都包含用于匹配其 ID 中某处所需的 ID 字符串,例如一个表中的样本1234"在另一个表中的 ID 为ProjectB(1234)".
I am trying to deal with some very messy data. I need to merge two large data frames which contain different kinds of data by the sample ID. The problem is that one table's sample IDs are in many different formats, but most contain the required ID string for matching somewhere in their ID, e.g. sample "1234" in one table has got an ID of "ProjectB(1234)" in the other.
我制作了一个可重现的最小示例.
I have made a minimal reproducible example.
a<-data.frame(aID=c("1234","4567","6789","3645"),aInfo=c("blue","green","goldenrod","cerulean"))
b<-data.frame(bID=c("4567","(1234)","6789","23645","63528973"), bInfo=c("apple","banana","kiwi","pomegranate","lychee"))
使用合并获得部分方法:
using merge gets part of the way:
merge(a,b, by.x="aID", by.y="bID", all=TRUE)
aID aInfo bInfo
1 1234 blue <NA>
2 3645 cerulean <NA>
3 4567 green apple
4 6789 goldenrod kiwi
5 (1234) <NA> banana
6 23645 <NA> pomegranate
7 63528973 <NA> lychee
但是想要的输出基本上是:
but the output that would be liked is basically:
ID aInfo bInfo
1 1234 blue banana
2 3645 cerulean pomegranate
3 4567 green apple
4 6789 goldenrod kiwi
5 63528973 <NA> lychee
我只是想知道是否有办法将 grep 合并到这个或另一个 R-tastic 方法中?
I just wondered if there was a way to incorporate grep into this or another R-tastic method?
提前致谢
推荐答案
对条件执行 merge
有点棘手.我不认为你可以用 merge
来完成它的编写,所以你最终不得不用 by
编写一个自定义函数.这是非常低效的,但是,merge
也是如此.如果您有数百万行,请考虑 data.table
.这就是执行内部联接"的方式,其中仅返回匹配的行.
Doing merge
on a condition is a little tricky. I don't think you can do it with merge
as it is written, so you end up having to write a custom function with by
. It is pretty inefficient, but then, so is merge
. If you have millions of rows, consider data.table
. This is how you would do a "inner join" where only rows that match are returned.
# I slightly modified your data to test multiple matches
a<-data.frame(aID=c("1234","1234","4567","6789","3645"),aInfo=c("blue","blue2","green","goldenrod","cerulean"))
b<-data.frame(bID=c("4567","(1234)","6789","23645","63528973"), bInfo=c("apple","banana","kiwi","pomegranate","lychee"))
f<-function(x) merge(x,b[agrep(x$aID[1],b$bID),],all=TRUE)
do.call(rbind,by(a,a$aID,f))
# aID aInfo bID bInfo
# 1234.1 1234 blue (1234) banana
# 1234.2 1234 blue2 (1234) banana
# 3645 3645 cerulean 23645 pomegranate
# 4567 4567 green 4567 apple
# 6789 6789 goldenrod 6789 kiwi
进行完全连接有点棘手.这是一种方法,仍然效率低下:
Doing a full join is a little trickier. This is one way, that is still inefficient:
f<-function(x,b) {
matches<-b[agrep(x[1,1],b[,1]),]
if (nrow(matches)>0) merge(x,matches,all=TRUE)
# Ugly... but how else to create a data.frame full of NAs?
else merge(x,b[NA,][1,],all.x=TRUE)
}
d<-do.call(rbind,by(a,a$aID,f,b))
left.over<-!(b$bID %in% d$bID)
rbind(d,do.call(rbind,by(b[left.over,],'bID',f,a))[names(d)])
# aID aInfo bID bInfo
# 1234.1 1234 blue (1234) banana
# 1234.2 1234 blue2 (1234) banana
# 3645 3645 cerulean 23645 pomegranate
# 4567 4567 green 4567 apple
# 6789 6789 goldenrod 6789 kiwi
# bID <NA> <NA> 63528973 lychee
这篇关于R 合并数据帧,允许不精确的 ID 匹配(例如,附加字符 1234 匹配 ab1234)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!