使用python通过一组通用标识符合并两个文件 [英] Merging two files by one common set of identifiers with python

查看:91
本文介绍了使用python通过一组通用标识符合并两个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想合并两个共享同一列的制表符分隔的文本文件.我有一个看起来像这样的标识符文件"(2列乘1050行):

I would like to merge two tab-delimited text files that share one common column. I have an 'identifier file' that looks like this (2 columns by 1050 rows):

module 1 gene 1
module 1 gene 2
..
module x gene y

我还有一个制表符分隔的目标"文本文件,看起来像这样(36列乘12000行):

I also have a tab-delimited 'target' text file that looks like this (36 columns by 12000 rows):

gene 1 sample 1 sample 2 etc
gene 2 sample 1 sample 2 etc
..
gene z sample 1 sample 2 etc

我想基于基因标识符合并两个文件,并具有匹配的表达式值和标识符文件和目标文件的模块关联.本质上是从标识符文件中获取基因,然后在目标文件中找到它们,并在一个文件中创建一个包含模块号,基因号和表达值的新文件.任何建议都将受到欢迎.

I would like to merge the two files based on the gene identifier and have both the matching expression values and module affiliations from the identifier and target files. Essentially to take the genes from the identifier file, find them in the target file and create a new file with module #, gene # and expression values all in one file. Any suggestions would be welcome.

我想要的输出是用标签分隔的Gene ID标签模块附属标签样本值.

My desired output is gene ID tab module affiliation tab sample values separated by tabs.

这是我想出的脚本.编写的脚本不会产生任何错误消息,但是会给我一个空文件.

Here is the script I came up with. The script written does not produce any error messages but it gives me an empty file.

expression_values = {}          
matches = []  
   with open("identifiers.txt") as ids, open("target.txt") as target:  
         for line in target:
             expression_values = {line.split()[0]:line.split()}
         for line in ids:
             block_idents=line.split()
         for gene in expression_values.iterkeys():    
             if gene==block_idents[1]:
                  matches.append(block_idents[0]+block_idents[1]+expression_values)  
csvfile = "modules.csv"  
with open(csvfile, "w") as output:
    writer = csv.writer(output, lineterminator='\n')
    for val in matches:
        writer.writerow([val])  

谢谢!

推荐答案

这些代码行并没有达到您期望的作用:

These lines of code are not doing what you are expecting them to do:

for line in target:
    expression_values = {line.split()[0]:line.split()}
for line in ids:
    block_idents=line.split()
for gene in expression_values.iterkeys():    
    if gene==block_idents[1]:
        matches.append(block_idents[0]+block_idents[1]+expression_values)

表达式值和block_idents仅根据更新文件的当前行才具有这些值.换句话说,随着阅读更多行,字典和列表不会增长".另外,使用csv模块也可以轻松解析TSV文件.

The expression values and block_idents will have the values only according to the current line of the files you are updating them with. In other words, the dictionary and the list are not "growing" as more lines are being read. Also TSV files, can be parsed with less effort using csv module.

我建议的这种粗略解决方案有一些假设:

There are a few assumptions I am making with this rough solution I am suggesting:

  1. 第一个文件中的基因"是将出现的唯一基因" 在第二个文件中.
  2. 第一个文件中可能有重复的基因".
  1. The "genes" in the first file are the only "genes" that will appear in the second file.
  2. There could duplicate "genes" in the first file.

首先在第一个文件中构建数据映射,如下所示:

First construct a map of the data in the first file as:

import csv
from collections import defaultdict
gene_map = defaultdict(list)
with open(first_file, 'rb') as file_one:
    csv_reader = csv.reader(file_one, delimiter='\t')
    for row in csv_reader:
        gene_map[row[1]].append(row[0])

读取第二个文件并同时写入输出文件.

Read the second file and write to the output file simultaneously.

with open(sec_file, 'rb') as file_two, open(op_file, 'w') as out_file:
    csv_reader = csv.reader(file_two, delimiter='\t')
    csv_writer = csv.writer(out_file, delimiter='\t')
    for row in csv_reader:
        values = gene_map.get(row[0], [])
        op_list = []
        op_list.append(row[0])
        op_list.extend(values)
        values.extend(row[1:])
        csv_writer.writerow(op_list)

这篇关于使用python通过一组通用标识符合并两个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆