R合并数据帧,允许不精确的ID匹配(例如,附加字符1234匹配ab1234) [英] R merge data frames, allow inexact ID matching (e.g. with additional characters 1234 matches ab1234 )

查看:172
本文介绍了R合并数据帧,允许不精确的ID匹配(例如,附加字符1234匹配ab1234)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试处理一些非常混乱的数据。我需要通过样本ID合并包含不同类型数据的两个大数据帧。问题是一个表的样本ID有许多不同的格式,但是大部分都包含所需的ID字符串,用于匹配其ID中的某个位置。另一个表中的样本1234的ID为ProjectB(1234)。



我做了一个很小的可重复的例子。

  a< data.frame(aID = c(1234,4567,6789,3645),aInfo = c(blue,green,goldenrod,cerulean))
b< -data.frame(bID = c(4567,(1234),6789,23645,63528973),bInfo = c(apple,banana,kiwi pomegranate,lychee))

使用合并获得部分方式:

  merge(a,b,by.x =aID,by.y =bID,all = TRUE)

aID aInfo bInfo
1 1234 blue< NA>
2 3645 cerulean< NA>
3 4567青苹果
4 6789 goldenrod猕猴桃
5(1234)< NA>香蕉
6 23645< NA>石榴
7 63528973< NA>荔枝

但基本上会喜欢的输出:

  ID aInfo bInfo 
1 1234 blue banana
2 3645 cerulean pomegranate
3 4567青苹果
4 6789 goldenrod猕猴桃b $ b 5 63528973< NA> lychee

我只是想知道有没有办法将grep合并到这个或另一个R-tastic方法?



提前感谢

解决方案

执行在一个条件下合并有点棘手。我不认为你可以使用 merge 来编写它,所以你最终不得不用编写一个自定义函数。这是非常低效的,但是,然后, merge 也是如此。如果您有数百万行,请考虑 data.table 。这是你将如何做一个内部连接,其中只返回匹配的行。

 #我略微修改了您的数据测试多个匹配
a <-data.frame(aID = c(1234,1234,4567,6789,3645),aInfo = c(blue,blue2 ,green,goldenrod,cerulean))
b <-data.frame(bID = c(4567,(1234),6789,23645,63528973 ),bInfo = c(apple,banana,kiwi,pomegranate,lychee))

f <功能(x)merge(x,b [agrep x $ aID [1],b $ bID),],all = TRUE)
do.call(rbind,by(a,a $ aID,f))

#aID aInfo bID bInfo
#1234.1 1234 blue(1234)banana
#1234.2 1234 blue2(1234)banana
#3645 3645 cerulean 23645 pomegranate
#4567 4567 green 4567 apple
#6789 6789 goldenrod 6789猕猴桃

做一个完整的连接有点棘手。这是一种仍然效率不高的方法:

  f< -function(x,b){
matches& -b [agrep(x [1,1],b [,1]),]
if(nrow(matches)> 0)merge(x,matches,all = TRUE)
# ...但是如何创建一个充满NAs的数据框架?
else merge(x,b [NA,] [1,],all.x = TRUE)
}
d< -do.call(rbind,by(a,a $ aID ,f,b))
left.over< - !(b $ bID%in%d $ bID)
rbind(d,do.call(rbind,by(b [left.over, ,'bID',f,a))[name(d)])

#aID aInfo bID bInfo
#1234.1 1234 blue(1234)banana
#1234.2 1234 blue2 (1234)香蕉
#3645 3645 cerulean 23645石榴
#4567 4567绿色4567苹果
#6789 6789 goldenrod 6789猕猴桃
#bID< NA> < NA> 63528973 lychee


I am trying to deal with some very messy data. I need to merge two large data frames which contain different kinds of data by the sample ID. The problem is that one table's sample IDs are in many different formats, but most contain the required ID string for matching somewhere in their ID, e.g. sample "1234" in one table has got an ID of "ProjectB(1234)" in the other.

I have made a minimal reproducible example.

a<-data.frame(aID=c("1234","4567","6789","3645"),aInfo=c("blue","green","goldenrod","cerulean"))
b<-data.frame(bID=c("4567","(1234)","6789","23645","63528973"), bInfo=c("apple","banana","kiwi","pomegranate","lychee"))

using merge gets part of the way:

merge(a,b, by.x="aID", by.y="bID", all=TRUE)

       aID     aInfo       bInfo
1     1234      blue        <NA>
2     3645  cerulean        <NA>
3     4567     green       apple
4     6789 goldenrod        kiwi
5   (1234)      <NA>      banana
6    23645      <NA> pomegranate
7 63528973      <NA>      lychee

but the output that would be liked is basically:

        ID     aInfo       bInfo
1     1234      blue      banana
2     3645  cerulean pomegranate
3     4567     green       apple
4     6789 goldenrod        kiwi
5 63528973      <NA>      lychee

I just wondered if there was a way to incorporate grep into this or another R-tastic method?

Thanks in advance

解决方案

Doing merge on a condition is a little tricky. I don't think you can do it with merge as it is written, so you end up having to write a custom function with by. It is pretty inefficient, but then, so is merge. If you have millions of rows, consider data.table. This is how you would do a "inner join" where only rows that match are returned.

# I slightly modified your data to test multiple matches    
a<-data.frame(aID=c("1234","1234","4567","6789","3645"),aInfo=c("blue","blue2","green","goldenrod","cerulean"))
b<-data.frame(bID=c("4567","(1234)","6789","23645","63528973"), bInfo=c("apple","banana","kiwi","pomegranate","lychee"))

f<-function(x) merge(x,b[agrep(x$aID[1],b$bID),],all=TRUE)
do.call(rbind,by(a,a$aID,f))

#         aID     aInfo    bID       bInfo
# 1234.1 1234      blue (1234)      banana
# 1234.2 1234     blue2 (1234)      banana
# 3645   3645  cerulean  23645 pomegranate
# 4567   4567     green   4567       apple
# 6789   6789 goldenrod   6789        kiwi

Doing a full join is a little trickier. This is one way, that is still inefficient:

f<-function(x,b) {
  matches<-b[agrep(x[1,1],b[,1]),]
  if (nrow(matches)>0) merge(x,matches,all=TRUE)
  # Ugly... but how else to create a data.frame full of NAs?
  else merge(x,b[NA,][1,],all.x=TRUE)
}
d<-do.call(rbind,by(a,a$aID,f,b))
left.over<-!(b$bID %in% d$bID)
rbind(d,do.call(rbind,by(b[left.over,],'bID',f,a))[names(d)])

#         aID     aInfo      bID       bInfo
# 1234.1 1234      blue   (1234)      banana
# 1234.2 1234     blue2   (1234)      banana
# 3645   3645  cerulean    23645 pomegranate
# 4567   4567     green     4567       apple
# 6789   6789 goldenrod     6789        kiwi
# bID    <NA>      <NA> 63528973      lychee

这篇关于R合并数据帧,允许不精确的ID匹配(例如,附加字符1234匹配ab1234)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆