Data.frame 合并和选择 2 个 Data.frames 中常见的值 [英] Data.frame Merge and Selection of values which are common in 2 Data.frames
问题描述
我有 3 个 data.frames
I have 3 data.frames
> head(ON1)
Entrez.ID Nearest.Refseq Gene.Name Tag.Count
1 11302 NM_007377 Aatk 137.48
2 11303 NM_013454 Abca1 118.09
3 11305 NM_007379 Abca2 93.56
4 11306 NM_009592 Abcb7 92.42
5 11308 NM_007380 Abi1 410.73
6 11356 NM_009598 Abl5 149.46
> head(ON2)
Entrez.ID Nearest.Refseq Gene.Name Tag.Count
1 11303 NM_013454 Abca1 86.02
2 11305 NM_007379 Abca2 103.45
3 11306 NM_009592 Abcb7 95.32
4 11308 NM_007380 Abi1 313.85
5 11350 NM_009594 Abl1 116.24
6 11352 NM_009595 Abl2 155.76
> head(ON3)
Entrez.ID Nearest.Refseq Gene.Name Tag.Count
1 11303 NM_013454 Abca1 69.49
2 11305 NM_007379 Abca2 82.02
3 11306 NM_009592 Abcb7 83.16
4 11308 NM_007380 Abi1 306.44
5 11350 NM_009594 Abl1 150.37
6 11355 NM_009599 Abl4 154.93
有些行仅对 1 个 data.frame 唯一(例如 ON1
的 row1
),有些在 2 个 data.frames 中通用(例如 row5
ON2
和 ON3
的代码>,ON1
中不存在这一行),有些在所有数据帧中是通用的(例如 ON1
ON1
的 >row2 即 ON2
和 ON3
中的 row1
).唯一的区别是最后一列的值Tag.Count
Some rows are unique to only 1 data.frame (e.g. row1
of ON1
), some are common in 2 data.frames (e.g. row5
of ON2
and ON3
, this row doesn't exist in ON1
) and some are common in all data.frames (e.g. row2
of ON1
which is row1
in ON2
and ON3
). The only difference is the values in last column Tag.Count
我想以这样一种方式合并所有 3 个 data.frames,以便我只获得最终 data.frame 中的那些行,这些行在至少 2 个 data.frames 和 Tag.Count<的值中是常见的其中较高的/code> 将分配到该行.
I want to merge all 3 data.frames in such a way that I obtain only those rows in my final data.frame which are common among at-least 2 data.frames and the value of Tag.Count
which is higher among them will be assigned to that row.
> head(F)
Entrez.ID Nearest.Refseq Gene.Name Tag.Count
1 11303 NM_013454 Abca1 118.09
2 11305 NM_007379 Abca2 103.45
3 11306 NM_009592 Abcb7 95.32
4 11308 NM_007380 Abi1 410.73
5 11350 NM_009594 Abl1 150.37
在这里您看到 Entrez.ID
= 11302
的行被删除,因为它只在所有 data.frames 中出现一次,并且那些至少在2 个 data.frames 出现在这里,但 Tag.Count
在所有 data.frames 中最大的分数被分配给该行.
Here you see that rows with Entrez.ID
= 11302
is removed as it only appeard once among all the data.frames and those rows which were common in at least 2 data.frames are appearing here but the Tag.Count
score which was maximum among all the data.frames is assigned to that row.
更新
如何取行的平均值.合并以上三个数据集后,我想对 Tag.count
值求和并除以总行数,而不是只保留具有最大 Tag.count
值的一行具有相同的 Entrez.ID
.实际上前 3 列中的值是相同的,差异仅出现在最后一列.示例:
How to take average of rows. After merging above three datasets, instead of just keeping one row which has maximum Tag.count
value, I want to sum the Tag.count
value and divide by total number of rows having the same Entrez.ID
. In fact the values in the first 3 columns are the same, difference only arises in last column.
Example:
> head(d)
Entrez.ID Nearest.Refseq Gene.Name Tag.Count
1 11302 NM_007377 Aatk 137.48
2 11303 NM_013454 Abca1 118.09
7886 11303 NM_013454 Abca1 86.02
15407 11303 NM_013454 Abca1 69.49
3 11305 NM_007379 Abca2 93.56
7887 11305 NM_007379 Abca2 103.45
因此,在这种情况下,由于 3 行具有 Entrez.ID
= 11303,Tag.count
值将相加 (118.09 + 86.02 + 69.49) 并除以按 3 和最终输出将只包含 1 行 Entrez.ID
11303 和 Tag.Count value
= Sum/no.of Rows
So, in this case as 3 rows have Entrez.ID
= 11303, the Tag.count
values will be summed up (118.09 + 86.02 + 69.49) and divided by 3 and final output will contain only 1 row having Entrez.ID
11303 and Tag.Count value
= Sum/no.of Rows
推荐答案
这是组合三个数据框的方法.将所有三个组合后,我们找到出现多次的值.使用该索引,我们可以使用函数 max
聚合数据框:
Here is a way with combining the three data frames. After combining all three, we find the values that appear more than once. With that index we can aggregate the data frame with the function max
:
d <- do.call(rbind, list(ON1, ON2, ON3))
d1 <- do.call(paste, d[1:3])
tbl <- table(d1) > 1L
indx <- d1 %in% names(tbl[tbl])
aggregate(Tag.Count ~., d[indx,], FUN=max)
# Entrez.ID Nearest.Refseq Gene.Name Tag.Count
# 1 11303 NM_013454 Abca1 118.09
# 2 11305 NM_007379 Abca2 103.45
# 3 11306 NM_009592 Abcb7 95.32
# 4 11308 NM_007380 Abi1 410.73
# 5 11350 NM_009594 Abl1 150.37
这篇关于Data.frame 合并和选择 2 个 Data.frames 中常见的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!