在两个字典中查找公用ID(交集) [英] Finding common ID's (intersection) in two dictionaries

查看:157
本文介绍了在两个字典中查找公用ID(交集)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一段代码,应该在两个不同的文件中找到第[1]行中的共同相交ID。在我的小样本文件它可以正常工作,但在我的较大的文件没有。我不知道为什么,你能否建议我什么是错的?确切的问题是当我的输入是200,它给我90个交叉点,如果我减少到150,它给我110的交点,逻辑上它不能更高。

  fileA = open(file1.txt,'r')
fileB = open(file2.txt ,'r')
output = open(result.txt,'w')
#fileA.next()

dictA = dict()
对于fileA中的line1:
listA = line1.split('\t')
dictA [listA [1]] = listA

dictB = dict()
for file1中的line1:
listB = line1.split('\t')
dictB [listB [1]] = listB

用于set(dictA ).intersection(dictB):
output.write(dictB [key] [0] +'\t'+ dictA [key] [1] +'\t'+ dictA [key] [4] +'\t'+ dictA [key] [5] +'\t'+ dictA [key] [9] +'\t'+ dictA [key] [10] +'\\\
')

我的文件1按行[0]排序,并有0-15行,以便在这里更简单给出一个例子,只放一行[0]和行[1],

  contig17 GRMZM2G052619_P03 xxxxxxxxxxxxxx 
contig33 AT2G41790.1 xxxxxxxxxxxxxx
contig98 GRMZM5G888620_ P01 xxxxxxxxxxxxx
contig102 GRMZM5G886789_P02 xxxxxxxxxxxxxx
contig123 AT3G57470.1 xxxxxxxxxxxxx

我的文件2是没有排序,有0-10行,我只给出行[1]

  y GRMZM2G052619_P03 yyyyyyyy $ b $由GRMZM5G888620_P01 yyyyyyyy $ b $由GRMZM5G886789_P02 yyyyyyyy 

我想要的输出,

  contig17 GRMZM2G052619_P03 yyyy 
contig98 GRMZM5G888620_P01 yyyy
contig102 GRMZM5G886789_P02 yyyy


解决方案

请注意:

  output.write (dictB [key] [0] +'\t'+ dictA [key] [1] 

这意味着你打印file2第一列比file1第二列。它不符合您的示例和所需的输出。



对于交集例程,它看起来很正确,所以可能是您的文件有问题。你确定所有的钥匙都是独一无二的吗?什么意思是减少到150 - 你的意思是从这个文件中删除一些行。



还要更好地替换

  for key(dictA).intersection(dictB):

 用于dictA中的密钥:
如果在dictB中键入:

实际上是一样的,但应该更快,花费更少的内存。


I wrote a piece of code that is supposed to find common intersecting ID's in line[1] in two different files. On my small sample files it works OK, but on my bigger files does not. I cannot figure out why, can you suggest me what is wrong? The exact problem is when my input is i.e. 200 it gives me 90 intersections, if I reduce it to 150, it gives me intersections of 110, logically it cannot be higher.

fileA = open("file1.txt",'r')
fileB = open("file2.txt",'r')
output = open("result.txt",'w')
#fileA.next()

dictA = dict()
for line1 in fileA:
    listA = line1.split('\t')
    dictA[listA[1]] = listA

dictB = dict()
for line1 in fileB:
    listB = line1.split('\t')
    dictB[listB[1]] = listB

for key in set(dictA).intersection(dictB):
    output.write(dictB[key][0]+'\t'+dictA[key][1]+'\t'+dictA[key][4]+'\t'+dictA[key][5]+'\t'+dictA[key][9]+'\t'+dictA[key][10]+'\n')

My file1 is sorted by line[0] and has 0-15 lines, to make it simpler here I give an example putting only line[0] and line[1],

contig17    GRMZM2G052619_P03  x x x x x x x x x x x x x x
contig33    AT2G41790.1    x x x x x x x x x x x x x x
contig98    GRMZM5G888620_P01  x x x x x x x x x x x x x x  
contig102   GRMZM5G886789_P02  x x x x x x x x x x x x x x  
contig123   AT3G57470.1    x x x x x x x x x x x x x x

My file2 is not sorted and has 0-10 line, I give only line[1]

y GRMZM2G052619_P03 y y y y y y y y         
y GRMZM5G888620_P01 y y y y y y y y     
y GRMZM5G886789_P02 y y y y y y y y     

My desired output,

contig17    GRMZM2G052619_P03  y y y y
contig98    GRMZM5G888620_P01  y y y y  
contig102   GRMZM5G886789_P02  y y y y  

解决方案

Pay attention to this:

output.write(dictB[key][0]+'\t'+dictA[key][1]

It means you print file2 first column than file1 second column. It doesn't correspond with your examples and desired output.

As for intersection routine, it looks quite correct, so probably it's something wrong with your file. Are you sure all keys are unique? What do you mean by "reduce to 150" - do you mean just deleting some lines from this very file.

Also better replace

for key in set(dictA).intersection(dictB):

with

for key in dictA:
   if key in dictB:

It's actually the same, but should be faster and spends less memory.

这篇关于在两个字典中查找公用ID(交集)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆