相对于行或样本,r中的多个数据帧的交叉 [英] Intersection of multiple dataframes in r with respect to rows or samples

查看:86
本文介绍了相对于行或样本,r中的多个数据帧的交叉的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有许多帧,其中的一些ids或列名称是相同的。我想合并单个数据帧中的所有数据帧,但只对所有数据帧中存在的样本进行合并。换句话说,我想要一个新的数据帧中的样本交集。例如,第一个数据框(df1)类似于

  m1 m2 m3 
P001 60.00 2.0 1
P002 14.30 2.077 1
P003 29.60 2.077 1.4
P004 10.30 2.077 1.3
P005 79.30 2.077 3.1
P006 79.30 2.077 3.1
P008 9.16 2.077 2.2



,第二个数据框(df2)看起来像

  patid n1 n2 n3 
P001 12.00 2.0 1
P003 17.60 1.7 1
P005 22.30 2.7 1
P006 26.30 1.7 1

同样,第三个数据帧

  patid k2 k3 k4 
P001 8.00 2.0 1.7
P004 9.60 1.7 1.8
P005 7.30 2.7 2.1
P008 6.30 1.7 1.9
P008 6.38 1.78 1.92

我想有一个第四个数据帧,该数据帧具有该数据帧中所有样本的交集。



答案可能是这样的

  m1 m2 m3 n1 n2 n3 k2 k3 k4 
P001 60.00 2.0 1 12.00 2.0 1 8.00 2.0 1.7
P005 79.30 2.077 3.1 22.30 2.7 1 7.30 2.7 2.1

更长的选择是使用循环。嵌套匹配项如

  matchmicSer <-df2 [match(rownames(df1),df2 $ patid]] 

matchserMic < - df1 [match(df2 $ patid,rownames(df1))]

并继续,但我相信R应该有一个捷径。合并将不是一个选项,因为第二和第三数据帧中的一些变量patid可能具有重复,如在数据帧中的P008。

解决方案

根据显示的示例,第一个数据集('df1')没有'patid'所以,从'rownames'创建列。

  df1 $ patid<  -  row.names(df1)

我们可以使用减少 merge 数据集在一个'列表'( mget(paste0('df',1:3))

  Reduce(function(...)merge(...,by ='patid'),mget(paste0('df',1:3)))
#patid m1 m2 m3 n1 n2 n3 k2 k3 k4
#1 P001 60.0 2.000 1.0 12.0 2.0 1 8.0 2.0 1.7
#2 P005 79.3 2.077 3.1 22.3 2.7 1 7.3 2.7 2.1



更新



关于重复 patid s,在'df3'中,有一个重复('P008'),但它不存在于所有的数据集中(所以在输出中找不到)。假设,如果我们有一个'patid'所有数据集并在其中一个中重复

  df3 $ patid [2]<  - 'P001'
Reduce(function(...)merge(...,by ='patid'),mget(paste0('df',1:3)))
#patid m1 m2 m3 n1 n2 n3 k2 k3 k4
#1 P001 60.0 2.000 1.0 12.0 2.0 1 8.0 2.0 1.7
#2 P001 60.0 2.000 1.0 12.0 2.0 1 9.6 1.7 1.8
#3 P005 79.3 2.077 3.1 22.3 2.7 1 7.3 2.7 2.1



data



  df1 < - 结构(列表(m1 = c(60,14.3,29.6,10.3,79.3,79.3,9.16),
m2 = c(2,277,2.077,2.077,2.077,2.077,2.077 ),m3 = c(1,
1,1.4,1.3,3.1,3.1,2.2)),.names = c(m1,m2,m3
) data.frame,row.names = c(P001,P002,P003,
P004,P005,P006,P008))

df2 < - 结构(列表(patid = c(P001,P003,P005,P006),
n1 = c(12,17.6,22.3,26.3) = c(2,1.7,2.7,1.7),n3 = c(1L,
1L,1L,1L)),.names = c(patid,n1,n2,n3 ),
class =data.frame,row.names = c(NA,-4L))

df3 < P004,P005,P008,
P008),k2 = c(8,9.6,7.3,6.3,6.38),k3 = c(2,1.7,2.7,1.7, $ b 1.78),k4 = c(1.7,1.8,2.1,1.9,1.92)),Names = c(patid,k2,
k3,k4 data.frame,row.names = c(NA,-5L))


I have a many frames, some of the ids or column names in them are the same. I want to merge all the dataframes in a single dataframe but only for samples that are present in all the dataframes. In other words, I want the intersection of samples in a new dataframe. For example First data frame (df1) looks like

       m1      m2     m3
P001   60.00   2.0     1
P002   14.30   2.077   1
P003   29.60   2.077   1.4
P004   10.30   2.077   1.3
P005   79.30   2.077   3.1
P006   79.30   2.077   3.1
P008    9.16   2.077   2.2

and the second data frame (df2) looks like

patid  n1      n2   n3
P001   12.00   2.0   1
P003   17.60   1.7   1
P005   22.30   2.7   1
P006   26.30   1.7   1

Similarly third dataframe

patid  k2      k3   k4
P001   8.00   2.0   1.7
P004   9.60   1.7   1.8
P005   7.30   2.7   2.1
P008   6.30   1.7   1.9
P008   6.38   1.78  1.92

I want to have a fourth dataframe that has intersection of all the samples in that dataframe. The samples in that dataframe in that data frame will be for eg P001 and P005.

The answer could be something like this

       m1      m2     m3      n1      n2    n3    k2     k3    k4
P001   60.00   2.0     1      12.00   2.0   1     8.00   2.0   1.7
P005   79.30   2.077   3.1    22.30   2.7   1     7.30   2.7   2.1

The longer option would be to use loops. A nested matches such as

matchmicSer <- df2[match(rownames(df1), df2$patid)]

matchserMic <- df1[match(df2$patid,rownames(df1))]

and continue but but I am sure R should have a shortcut. Merge would not be an option because some of the variables patid in second and third dataframe may have duplicates, like P008 in thrid dataframe.

解决方案

Based on the example showed, the first dataset ('df1') didn't have 'patid' column. So, created the column from the 'rownames'.

df1$patid <- row.names(df1)

We can use Reduce with merge after placing the datasets in a 'list' (mget(paste0('df', 1:3))

Reduce(function(...) merge(..., by='patid'), mget(paste0('df', 1:3)))
#  patid   m1    m2  m3   n1  n2 n3  k2  k3  k4
#1  P001 60.0 2.000 1.0 12.0 2.0  1 8.0 2.0 1.7
#2  P005 79.3 2.077 3.1 22.3 2.7  1 7.3 2.7 2.1

Update

Regarding the duplicate patids , in the 'df3', there is a duplicate ('P008'), but it is not present in all the datasets (so not found in the output). Suppose, if we have a 'patid' that is present in all the datasets and is duplicated in one of them

 df3$patid[2] <- 'P001'
 Reduce(function(...) merge(..., by='patid'), mget(paste0('df', 1:3)))
 #  patid   m1    m2  m3   n1  n2 n3  k2  k3  k4
 #1  P001 60.0 2.000 1.0 12.0 2.0  1 8.0 2.0 1.7
 #2  P001 60.0 2.000 1.0 12.0 2.0  1 9.6 1.7 1.8
 #3  P005 79.3 2.077 3.1 22.3 2.7  1 7.3 2.7 2.1

data

 df1 <- structure(list(m1 = c(60, 14.3, 29.6, 10.3, 79.3, 79.3, 9.16), 
 m2 = c(2, 2.077, 2.077, 2.077, 2.077, 2.077, 2.077), m3 = c(1, 
 1, 1.4, 1.3, 3.1, 3.1, 2.2)), .Names = c("m1", "m2", "m3"
 ), class = "data.frame", row.names = c("P001", "P002", "P003", 
 "P004", "P005", "P006", "P008"))

df2 <-  structure(list(patid = c("P001", "P003", "P005", "P006"),
 n1 = c(12, 17.6, 22.3, 26.3), n2 = c(2, 1.7, 2.7, 1.7), n3 = c(1L,
1L, 1L, 1L)), .Names = c("patid", "n1", "n2", "n3"),
 class = "data.frame", row.names = c(NA, -4L))

df3 <- structure(list(patid = c("P001", "P004", "P005", "P008",
 "P008"), k2 = c(8, 9.6, 7.3, 6.3, 6.38), k3 = c(2, 1.7, 2.7, 1.7,
 1.78), k4 = c(1.7, 1.8, 2.1, 1.9, 1.92)), .Names = c("patid", "k2", 
 "k3", "k4"), class = "data.frame", row.names = c(NA, -5L))

这篇关于相对于行或样本,r中的多个数据帧的交叉的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆