如何比较数据帧1的每一行与数据帧2的每一行? [英] How to compare each row of data frame 1 with each row of data frame 2?

查看:98
本文介绍了如何比较数据帧1的每一行与数据帧2的每一行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据框,如下所示:

I have two data frames that look like this:

x=data.frame(Name=c("200003","200260","400826","400863","500710"),Chr=c("chr1","chr1","chr2","chr3","chr3"),Position=c(11880,14415,13000,15000,18000))    
y=data.frame(name=c("geneA","geneB","geneC","geneD","geneE"),chrom=c("chr1","chr1","chr2","chr2","chr3"),Start=c(11873,11878,12000,14361,14361),End=c(14409,14419,14409,16765,19759))

> x
    Name  Chr Position
1 200003 chr1    11880
2 200260 chr1    14415
3 400826 chr2    13000
4 400863 chr3    15000
5 500710 chr3    18000

> y
   name chrom   Start   End
1 geneA  chr1   11873 14409
2 geneB  chr1   11878 14419
3 geneC  chr2   12000 14409
4 geneD  chr2   14361 16765
5 geneE  chr3   14361 19759



我想比较x和y,并返回一个数据帧或列表在x中的每个名称以及与Chr和(开始,结束)间隔具有相同的chrom的y的名称包括位置。例如,

I would like to compare x and y, and return a dataframe or list consisting of each Name in x and the names of y that has the same chrom as Chr and the (Start,End) interval includes the Position. For example,

200003  geneA
200003  geneB
200260  geneB
400826  geneC
400863  geneE
500710  geneE

编辑:我可以使用以下方法获得结果

I was able to get the result using the following code

z=merge(x,y,by.x='Chr',by.y='chrom')
z=cbind(z,with(z, Position>=Start & Position<=End))
z=z[-which(z[,7]=="FALSE"),]
output=cbind(as.character(z$Name),as.character(z$name))

实际上,x和y以及大型数据集,需要一段时间才能运行 merge 。有更好的方法吗?

In reality x and y and large datasets and it takes a while for merge to run. Is there a better way to do this?

推荐答案

@BondedDust似乎已经删除了他的解决方案。他的解决方案的唯一问题是关键还需要包括 chrom

@BondedDust seems to have removed his solution. The only issue with his solution is that the key needs to also include chrom.

这是使用 data.table 中的 foverlaps 。首先我们将data.frames转换为data.tables:

Here's using foverlaps from data.table. First we'll convert the data.frames to data.tables:

require(data.table)
setDT(x)
setDT(y)


$ b 适用于区间范围,我们将为 x 添加一个虚拟列,如下所示:

Then, since foverlaps works with interval ranges, we'll add a dummy column for x as follows:

x[, Position2 := Position]

每个 x ,我们想知道 Chr,Position,Position2 是否全部 >任何 y chrome,开始,结束。我们将使用 y 作为key,如下所示:

Now, for each x, we'd like to know if Chr, Position, Position2 falls entire within any y's chrome,Start,End. We'll use y as "key" as follows:

setkey(y, chrom, Start, End)
foverlaps(x, y, by.x=c("Chr", "Position", "Position2"))[, list(Name, name)]
#      Name  name
# 1: 200003 geneA
# 2: 200003 geneB
# 3: 200260 geneB
# 4: 400826 geneC
# 5: 400863 geneE
# 6: 500710 geneE

data.frames中的列异常命名和套用 - chrom对Chr。使用一致的名称可能更容易。

The columns in your data.frames are unusually named and cased - "chrom" vs "Chr". It might be easier to work with consistent names.

这篇关于如何比较数据帧1的每一行与数据帧2的每一行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆