如何有条件地计数和记录样本是否出现在另一个数据集的行中? [英] How to conditionally count and record if a sample appears in rows of another dataset?

查看:84
本文介绍了如何有条件地计数和记录样本是否出现在另一个数据集的行中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个ID的遗传数据集(数据集1)和一个相互交互的ID数据集(数据集2).我正在尝试对数据集1中的ID进行计数,该ID出现在数据集2的2个交互列中的任意一个中,并且还要记录在第3列中的交互/匹配ID.

数据集1:

ID
1
2
3

数据集2:

Interactor1    Interactor2
1                  5
2                  3
1                  10

输出:

ID   InteractionCount    Interactors
1            2               5, 10
2            1                3
3            1                2

因此,输出包含数据集1的所有ID,并且这些ID的计数也会出现在数据集2的第1列或第2列中,如果确实出现,它还会存储与之交互的数据集2中的ID号.

我有生物学背景,因此猜想要采用这种方法,到目前为止,我已经设法使用merge()setDT(mergeddata)[, .N, by=ID]来计算出现在dataset2中的dataset1 ID,但是我不确定是否这是能够在创建存储交互ID的列中添加的正确方法.对于可以在第三列中存储匹配ID的可能功能的任何帮助,将不胜感激.

输入数据:

dput(dataset1)
structure(list(ID = 1:3), row.names = c(NA, -3L), class = c("data.table", 
"data.frame"))

dput(dataset2)
structure(list(Interactor1 = c(1L, 2L, 1L), Interactor2 = c(5L, 
3L, 10L)), row.names = c(NA, -3L), class = c("data.table", "data.frame"
))

解决方案

以下是使用data.table的选项:

x <- names(DT2)
cols <- c("InteractionCount", "Interactors")

#ensure that the pairs are ordered for each row and there are no duplicated pairs
DT2 <- setkeyv(unique(DT2[,(x) := .(pmin(i1, i2), pmax(i1, i2))]), x)

#for each ID find the neighbours linked to it
neighbours <- rbindlist(list(DT2[, .(.N, toString(i2)), i1],
    DT2[, .(.N, toString(i1)), i2]), use.names=FALSE)
setnames(neighbours, names(neighbours), c("ID", cols))

#update dataset1 using the above data
dataset1[, (cols) := neighbours[dataset1, on=.(ID), mget(cols)]]

dataset1的输出:

   ID InteractionCount Interactors
1:  1                2       5, 10
2:  2                1           3
3:  3                1           2

数据:

library(data.table)
DT1 <- structure(list(ID = 1:3), row.names = c(NA, -3L), class = c("data.table", "data.frame"))
DT2 <- structure(list(i1 = c(1L, 2L, 1L), i2 = c(5L, 3L, 10L)), row.names = c(NA, -3L), class = c("data.table", "data.frame"))

I have a genetic dataset of IDs (dataset1) and a dataset of IDs which interact with each other (dataset2). I am trying to count IDs in dataset1 which appear in either of 2 interaction columns in dataset2 and also record which are the interacting/matching IDs in a 3rd column.

Dataset1:

ID
1
2
3

Dataset2:

Interactor1    Interactor2
1                  5
2                  3
1                  10

Output:

ID   InteractionCount    Interactors
1            2               5, 10
2            1                3
3            1                2

So the output contains all IDs of dataset1 and a count of those IDs also appear in either column 1 or 2 of dataset2, and if it did appear it also stores which ID numbers in dataset2 it interacts with.

I have a biology background, so have guessed at approaching this, so far I've managed to use merge() and setDT(mergeddata)[, .N, by=ID] to try to count the dataset1 IDs which appear in dataset2, but I'm not sure if this is the right approach to be able to add in the creation of the column storing the interacting IDs. Any help on possible functions which can store matched IDs in a 3rd column would be appreciated.

Input data:

dput(dataset1)
structure(list(ID = 1:3), row.names = c(NA, -3L), class = c("data.table", 
"data.frame"))

dput(dataset2)
structure(list(Interactor1 = c(1L, 2L, 1L), Interactor2 = c(5L, 
3L, 10L)), row.names = c(NA, -3L), class = c("data.table", "data.frame"
))

解决方案

Here is an option using data.table:

x <- names(DT2)
cols <- c("InteractionCount", "Interactors")

#ensure that the pairs are ordered for each row and there are no duplicated pairs
DT2 <- setkeyv(unique(DT2[,(x) := .(pmin(i1, i2), pmax(i1, i2))]), x)

#for each ID find the neighbours linked to it
neighbours <- rbindlist(list(DT2[, .(.N, toString(i2)), i1],
    DT2[, .(.N, toString(i1)), i2]), use.names=FALSE)
setnames(neighbours, names(neighbours), c("ID", cols))

#update dataset1 using the above data
dataset1[, (cols) := neighbours[dataset1, on=.(ID), mget(cols)]]

output for dataset1:

   ID InteractionCount Interactors
1:  1                2       5, 10
2:  2                1           3
3:  3                1           2

data:

library(data.table)
DT1 <- structure(list(ID = 1:3), row.names = c(NA, -3L), class = c("data.table", "data.frame"))
DT2 <- structure(list(i1 = c(1L, 2L, 1L), i2 = c(5L, 3L, 10L)), row.names = c(NA, -3L), class = c("data.table", "data.frame"))

这篇关于如何有条件地计数和记录样本是否出现在另一个数据集的行中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆