相对于行或样本,r中的多个数据帧的交叉 [英] Intersection of multiple dataframes in r with respect to rows or samples
问题描述
我有许多帧,其中的一些ids或列名称是相同的。我想合并单个数据帧中的所有数据帧,但只对所有数据帧中存在的样本进行合并。换句话说,我想要一个新的数据帧中的样本交集。例如,第一个数据框(df1)类似于
m1 m2 m3
P001 60.00 2.0 1
P002 14.30 2.077 1
P003 29.60 2.077 1.4
P004 10.30 2.077 1.3
P005 79.30 2.077 3.1
P006 79.30 2.077 3.1
P008 9.16 2.077 2.2
,第二个数据框(df2)看起来像
patid n1 n2 n3
P001 12.00 2.0 1
P003 17.60 1.7 1
P005 22.30 2.7 1
P006 26.30 1.7 1
同样,第三个数据帧
patid k2 k3 k4
P001 8.00 2.0 1.7
P004 9.60 1.7 1.8
P005 7.30 2.7 2.1
P008 6.30 1.7 1.9
P008 6.38 1.78 1.92
我想有一个第四个数据帧,该数据帧具有该数据帧中所有样本的交集。
答案可能是这样的
m1 m2 m3 n1 n2 n3 k2 k3 k4
P001 60.00 2.0 1 12.00 2.0 1 8.00 2.0 1.7
P005 79.30 2.077 3.1 22.30 2.7 1 7.30 2.7 2.1
更长的选择是使用循环。嵌套匹配项如
matchmicSer <-df2 [match(rownames(df1),df2 $ patid]]
matchserMic < - df1 [match(df2 $ patid,rownames(df1))]
并继续,但我相信R应该有一个捷径。合并将不是一个选项,因为第二和第三数据帧中的一些变量patid可能具有重复,如在数据帧中的P008。
解决方案根据显示的示例,第一个数据集('df1')没有'patid'所以,从'rownames'创建列。
df1 $ patid< - row.names(df1)
我们可以使用
减少
与merge
数据集在一个'列表'(mget(paste0('df',1:3))
Reduce(function(...)merge(...,by ='patid'),mget(paste0('df',1:3)))
#patid m1 m2 m3 n1 n2 n3 k2 k3 k4
#1 P001 60.0 2.000 1.0 12.0 2.0 1 8.0 2.0 1.7
#2 P005 79.3 2.077 3.1 22.3 2.7 1 7.3 2.7 2.1
更新
关于重复
patid
s,在'df3'中,有一个重复('P008'),但它不存在于所有的数据集中(所以在输出中找不到)。假设,如果我们有一个'patid'所有数据集并在其中一个中重复df3 $ patid [2]< - 'P001'
Reduce(function(...)merge(...,by ='patid'),mget(paste0('df',1:3)))
#patid m1 m2 m3 n1 n2 n3 k2 k3 k4
#1 P001 60.0 2.000 1.0 12.0 2.0 1 8.0 2.0 1.7
#2 P001 60.0 2.000 1.0 12.0 2.0 1 9.6 1.7 1.8
#3 P005 79.3 2.077 3.1 22.3 2.7 1 7.3 2.7 2.1
data
df1 < - 结构(列表(m1 = c(60,14.3,29.6,10.3,79.3,79.3,9.16),
m2 = c(2,277,2.077,2.077,2.077,2.077,2.077 ),m3 = c(1,
1,1.4,1.3,3.1,3.1,2.2)),.names = c(m1,m2,m3
) data.frame,row.names = c(P001,P002,P003,
P004,P005,P006,P008))
df2 < - 结构(列表(patid = c(P001,P003,P005,P006),
n1 = c(12,17.6,22.3,26.3) = c(2,1.7,2.7,1.7),n3 = c(1L,
1L,1L,1L)),.names = c(patid,n1,n2,n3 ),
class =data.frame,row.names = c(NA,-4L))
df3 < P004,P005,P008,
P008),k2 = c(8,9.6,7.3,6.3,6.38),k3 = c(2,1.7,2.7,1.7, $ b 1.78),k4 = c(1.7,1.8,2.1,1.9,1.92)),Names = c(patid,k2,
k3,k4 data.frame,row.names = c(NA,-5L))
I have a many frames, some of the ids or column names in them are the same. I want to merge all the dataframes in a single dataframe but only for samples that are present in all the dataframes. In other words, I want the intersection of samples in a new dataframe. For example First data frame (df1) looks like
m1 m2 m3 P001 60.00 2.0 1 P002 14.30 2.077 1 P003 29.60 2.077 1.4 P004 10.30 2.077 1.3 P005 79.30 2.077 3.1 P006 79.30 2.077 3.1 P008 9.16 2.077 2.2
and the second data frame (df2) looks like
patid n1 n2 n3 P001 12.00 2.0 1 P003 17.60 1.7 1 P005 22.30 2.7 1 P006 26.30 1.7 1
Similarly third dataframe
patid k2 k3 k4 P001 8.00 2.0 1.7 P004 9.60 1.7 1.8 P005 7.30 2.7 2.1 P008 6.30 1.7 1.9 P008 6.38 1.78 1.92
I want to have a fourth dataframe that has intersection of all the samples in that dataframe. The samples in that dataframe in that data frame will be for eg P001 and P005.
The answer could be something like this
m1 m2 m3 n1 n2 n3 k2 k3 k4 P001 60.00 2.0 1 12.00 2.0 1 8.00 2.0 1.7 P005 79.30 2.077 3.1 22.30 2.7 1 7.30 2.7 2.1
The longer option would be to use loops. A nested matches such as
matchmicSer <- df2[match(rownames(df1), df2$patid)] matchserMic <- df1[match(df2$patid,rownames(df1))]
and continue but but I am sure R should have a shortcut. Merge would not be an option because some of the variables patid in second and third dataframe may have duplicates, like P008 in thrid dataframe.
解决方案Based on the example showed, the first dataset ('df1') didn't have 'patid' column. So, created the column from the 'rownames'.
df1$patid <- row.names(df1)
We can use
Reduce
withmerge
after placing the datasets in a 'list' (mget(paste0('df', 1:3))
Reduce(function(...) merge(..., by='patid'), mget(paste0('df', 1:3))) # patid m1 m2 m3 n1 n2 n3 k2 k3 k4 #1 P001 60.0 2.000 1.0 12.0 2.0 1 8.0 2.0 1.7 #2 P005 79.3 2.077 3.1 22.3 2.7 1 7.3 2.7 2.1
Update
Regarding the duplicate
patid
s , in the 'df3', there is a duplicate ('P008'), but it is not present in all the datasets (so not found in the output). Suppose, if we have a 'patid' that is present in all the datasets and is duplicated in one of themdf3$patid[2] <- 'P001' Reduce(function(...) merge(..., by='patid'), mget(paste0('df', 1:3))) # patid m1 m2 m3 n1 n2 n3 k2 k3 k4 #1 P001 60.0 2.000 1.0 12.0 2.0 1 8.0 2.0 1.7 #2 P001 60.0 2.000 1.0 12.0 2.0 1 9.6 1.7 1.8 #3 P005 79.3 2.077 3.1 22.3 2.7 1 7.3 2.7 2.1
data
df1 <- structure(list(m1 = c(60, 14.3, 29.6, 10.3, 79.3, 79.3, 9.16), m2 = c(2, 2.077, 2.077, 2.077, 2.077, 2.077, 2.077), m3 = c(1, 1, 1.4, 1.3, 3.1, 3.1, 2.2)), .Names = c("m1", "m2", "m3" ), class = "data.frame", row.names = c("P001", "P002", "P003", "P004", "P005", "P006", "P008")) df2 <- structure(list(patid = c("P001", "P003", "P005", "P006"), n1 = c(12, 17.6, 22.3, 26.3), n2 = c(2, 1.7, 2.7, 1.7), n3 = c(1L, 1L, 1L, 1L)), .Names = c("patid", "n1", "n2", "n3"), class = "data.frame", row.names = c(NA, -4L)) df3 <- structure(list(patid = c("P001", "P004", "P005", "P008", "P008"), k2 = c(8, 9.6, 7.3, 6.3, 6.38), k3 = c(2, 1.7, 2.7, 1.7, 1.78), k4 = c(1.7, 1.8, 2.1, 1.9, 1.92)), .Names = c("patid", "k2", "k3", "k4"), class = "data.frame", row.names = c(NA, -5L))
这篇关于相对于行或样本,r中的多个数据帧的交叉的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!