根据python中两个文件的列坐标合并文件 [英] merging files based on column coordinates of two files in python
问题描述
我有一个名为snp.txt的文件,如下所示:
I have a file called snp.txt that looks like this:
chrom chromStart chromEnd name strand observed
chr1 259 260 rs72477211 + A/G single
chr1 433 433 rs56289060 + -/C insertion
chr1 491 492 rs55998931 + C/T single
chr1 518 519 rs62636508 + C/G single
chr1 582 583 rs58108140 + A/G single
我还有第二个文件gene.txt
I have a second file gene.txt
chrom chromStart chromEnd tf_title tf_score
chr1 200 270 NFKB1 123
chr1 420 440 IRF4 234
chr1 488 550 BCL3 231
chr1 513 579 TCF12 12
chr1 582 583 BAD170 89
我想要的最终输出是:output.txt
The final output I want is: output.txt
chrom chromStart chromEnd name strand observed tf_title tf_score
chr1 259 260 rs72477211 + A/G NFKB1 123
chr1 433 433 rs56289060 + -/C IRF4 234
chr1 491 492 rs55998931 + C/T BCL3 231
chr1 518 519 rs62636508 + C/G TCF12 12
chr1 582 583 rs58108140 + A/G BAD170 89
我想做的关键是查看gene.txt,并检查snp.txt名称栏中的rsnumber是否在由chrom,chromStart和chromEnd建立的同一区域中.
The key thing I want to be able to do is to look at gene.txt and check if the rsnumber in the name column of snp.txt is in the same region established by chrom, chromStart and chromEnd.
例如:
snp.txt的第一行 rsid rs72477211在chr1上的位置259和260之间.
In the first row of snp.txt the rsid rs72477211 is on chr1 between positions 259 and 260.
现在在gene.txt中,NFKB1也在chr1上,但在200和270之间, 这意味着rsid rs72477211位于NFKB1区域,因此在输出txt中会注明.
Now in gene.txt, NFKB1 is also on chr1 but between positions 200 and 270, this means that rsid rs72477211 is located the NFKB1 region, so this is noted in output txt.
在使用pandas合并功能时,我无法做到这一点,而且我不确定从哪里开始. 文件非常大,因此循环效率极低. 有人可以帮忙吗?谢谢!
I am unable to do this in using pandas merge function and I'm not sure where to even start. the files are extremely large so a loop would be highly inefficient. Can someone please help? Thanks!
推荐答案
If it fits in memory, you can merge
the two dataframes with an outer
method base only on chrom
column, then filter your result by doing the range inclusion math:
df = snp.merge(gene, how='outer', on='chrom')
df = df[(df.chromStart_x>=df.chromStart_y) & (df.chromEnd_x<=df.chromEnd_y)]
您最终可以删除重复的列:
You can eventually remove the duplicate columns:
del test['chromStart_y']
del test['chromEnd_y']
这篇关于根据python中两个文件的列坐标合并文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!