根据python中两个文件的列坐标合并文件 [英] merging files based on column coordinates of two files in python

查看:418
本文介绍了根据python中两个文件的列坐标合并文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个名为snp.txt的文件,如下所示:

I have a file called snp.txt that looks like this:

chrom   chromStart  chromEnd    name    strand     observed     
chr1    259         260      rs72477211  +   A/G    single  
chr1    433         433      rs56289060  +   -/C    insertion   
chr1    491         492      rs55998931  +   C/T    single  
chr1    518         519      rs62636508  +   C/G    single  
chr1    582         583      rs58108140  +   A/G    single  

我还有第二个文件gene.txt

I have a second file gene.txt

chrom   chromStart  chromEnd    tf_title    tf_score
chr1    200         270         NFKB1       123
chr1    420         440         IRF4        234
chr1    488         550         BCL3        231
chr1    513         579         TCF12       12
chr1    582         583         BAD170      89

我想要的最终输出是:output.txt

The final output I want is: output.txt

chrom   chromStart  chromEnd    name    strand  observed    tf_title    tf_score
chr1    259         260      rs72477211    +    A/G         NFKB1       123
chr1    433         433      rs56289060    +    -/C         IRF4        234
chr1    491         492      rs55998931    +    C/T         BCL3        231
chr1    518         519      rs62636508    +    C/G         TCF12       12
chr1    582         583      rs58108140    +    A/G         BAD170      89

我想做的关键是查看gene.txt,并检查snp.txt名称栏中的rsnumber是否在由chrom,chr​​omStart和chromEnd建立的同一区域中.

The key thing I want to be able to do is to look at gene.txt and check if the rsnumber in the name column of snp.txt is in the same region established by chrom, chromStart and chromEnd.

例如:

snp.txt的第一行 rsid rs72477211在chr1上的位置259和260之间.

In the first row of snp.txt the rsid rs72477211 is on chr1 between positions 259 and 260.

现在在gene.txt中,NFKB1也在chr1上,但在200和270之间, 这意味着rsid rs72477211位于NFKB1区域,因此在输出txt中会注明.

Now in gene.txt, NFKB1 is also on chr1 but between positions 200 and 270, this means that rsid rs72477211 is located the NFKB1 region, so this is noted in output txt.

在使用pandas合并功能时,我无法做到这一点,而且我不确定从哪里开始. 文件非常大,因此循环效率极低. 有人可以帮忙吗?谢谢!

I am unable to do this in using pandas merge function and I'm not sure where to even start. the files are extremely large so a loop would be highly inefficient. Can someone please help? Thanks!

推荐答案

如果适合内存,则可以

If it fits in memory, you can merge the two dataframes with an outer method base only on chrom column, then filter your result by doing the range inclusion math:

df = snp.merge(gene, how='outer', on='chrom')
df = df[(df.chromStart_x>=df.chromStart_y) & (df.chromEnd_x<=df.chromEnd_y)]

您最终可以删除重复的列:

You can eventually remove the duplicate columns:

del test['chromStart_y']
del test['chromEnd_y']

这篇关于根据python中两个文件的列坐标合并文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆