使用 defaultdict 解析多分隔符文件 [英] Using defaultdict to parse multi delimiter file

查看:47
本文介绍了使用 defaultdict 解析多分隔符文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要解析一个包含如下内容的文件:

I need to parse a file which has contents that look like this:

20  31022550    G   1396    =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  A:2:60.00:33.00:37.00:2:0:0.02:0.02:40.00:2:0.98:126.00:0.98    C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  G:1391:60.00:36.08:36.97:719:672:0.51:0.01:7.59:719:0.49:126.00:0.50    T:1:60.00:33.00:37.00:0:1:0.37:0.02:47.00:0:0.00:126.00:0.18    N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  +A:2:60.00:0.00:37.00:2:0:0.67:0.01:0.00:2:0.65:126.00:0.65
20  31022551    A   1271    =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  A:960:60.00:35.23:36.99:496:464:0.50:0.00:6.38:496:0.49:126.00:0.52 C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  G:13:60.00:35.00:35.92:4:9:0.13:0.02:44.92:4:0.98:126.00:0.37   T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  +G:288:60.00:0.00:37.00:171:117:0.57:0.01:8.17:171:0.54:126.00:0.53 +GG:9:60.00:0.00:37.00:5:4:0.71:0.03:23.67:5:0.50:126.00:0.57   +GGG:1:60.00:0.00:37.00:1:0:0.51:0.03:14.00:1:0.24:126.00:0.24

解析后我希望它看起来

20  31022550    G   1396    =   0   0   0   0   0   0   0   0   0   0   0   0
20  31022550    G   1396    A   2   60  33  37  2   0   0.02    0.02    40  2   0.98    126
20  31022550    G   1396    C   0   0   0   0   0   0   0   0   0   0   0   0
20  31022550    G   1396    G   1391    60  36.08   36.97   719 672 0.51    0.01    7.59    719 0.49    126
20  31022550    G   1396    T   1   60  33  37  0   1   0.37    0.02    47  0   0   126
20  31022550    G   1396    N   0   0   0   0   0   0   0   0   0   0   0   0
20  31022550    G   1396    +A  2   60  0   37  2   0   0.67    0.01    0   2   0.65    126
20  31022551    A   1271    =   0   0   0   0   0   0   0   0   0   0   0   0
20  31022551    A   1271    A   960 60  35.23   36.99   496 464 0.5 0   6.38    496 0.49    126
20  31022551    A   1271    C   0   0   0   0   0   0   0   0   0   0   0   0
20  31022551    A   1271    G   13  60  35  35.92   4   9   0.13    0.02    44.92   4   0.98    126
20  31022551    A   1271    T   0   0   0   0   0   0   0   0   0   0   0   0
20  31022551    A   1271    N   0   0   0   0   0   0   0   0   0   0   0   0
20  31022551    A   1271    +G  288 60  0   37  171 117 0.57    0.01    8.17    171 0.54    126
20  31022551    A   1271    +GG 9   60  0   37  5   4   0.71    0.03    23.67   5   0.5 126
20  31022551    A   1271    +GGG    1   60  0   37  1   0   0.51    0.03    14  1   0.24    126

我有更多行基于 column[1] 31022550...31022NNN

I have more lines where it increments based on column[1] 31022550...31022NNN

我在这里尝试做的是仅使用此伪代码打印文件的某些部分,将 column[1] 作为关键

What I am trying to do here is to only print certain parts of the file with this pseudo code keeping the column[1] as key

from collections import defaultdict
ids = defaultdict(list)

with open('~/file.tsv', 'r') as f:
    for line in f:
        lines = line.strip().split('\t')
        pos = (lines[0:3])
        for ele in lines[4:]:
            # print pos
            p = pos[1].strip()
            base = ele.split(':')[0]
            ids[p] = {
                'pos': pos[0].strip(),
                'base': base,
                'count': ele.split(':')[1],
                '_pos': ele.split(':')[5],
                '_neg': ele.split(':')[6]
                }
\
for k,v in ids.iteritems():
    print k,v

输出

31022550 {'count': '2', 'base': '+A', 'pos': '20', '_neg': '0', '_pos': '2'}
31022551 {'count': '1', 'base': '+GGG', 'pos': '20', '_neg': '0', '_pos': '1'}

不知道为什么我没有看到 31022550 作为键值对保存的所有字段.

Not sure why I do not see all the fields that 31022550 holds as key value pair.

推荐答案

您只将最后一个字典分配给您的 p 键:

You are assigning only the last dictionary to your p key:

ids[p] = {
    'pos': pos[0].strip(),
    'base': base,
    'count': ele.split(':')[1],
    '_pos': ele.split(':')[5],
    '_neg': ele.split(':')[6]
}

这完全绕过了新密钥的工厂;您只是分配了一个字典值.如果您想为每个键构建一个字典列表,您需要使用 list.append():

This bypasses the factory for new keys altogether; you are just assigning a dictionary value instead. If you wanted to build a list of dictionaries per key, you'd need to use list.append():

ids[p].append({
    'pos': pos[0].strip(),
    'base': base,
    'count': ele.split(':')[1],
    '_pos': ele.split(':')[5],
    '_neg': ele.split(':')[6]
})

这将查找 ids[p] 值(如果键尚不存在,则将其创建为空列表),然后将字典附加到该列表的末尾.

This looks up the ids[p] value (which then is created as an empty list if the key does not yet exist), and you then append your dictionary to the end of that list.

我会使用 csv 模块来稍微简化代码以处理行的拆分:

I'd simplify the code somewhat using the csv module to handle splitting of the lines:

import csv
from collections import defaultdict
ids = defaultdict(list)

with open('~/file.tsv', 'rb') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        pos, key = row[:2]
        for elems in row[4:]:
            elems = elems.split(':')
            ids[key].append({
                'pos': pos,
                'base': elems[0],
                'count': elems[1],
                '_pos': elems[5],
                '_neg': elems[6]
            })

for key, rows in ids.iteritems():
    for row in rows:
        print '{}\t{}'.format(key, row)

这会产生:

31022550    {'count': '0', 'base': '=', 'pos': '20', '_neg': '0', '_pos': '0'}
31022550    {'count': '2', 'base': 'A', 'pos': '20', '_neg': '0', '_pos': '2'}
31022550    {'count': '0', 'base': 'C', 'pos': '20', '_neg': '0', '_pos': '0'}
31022550    {'count': '1391', 'base': 'G', 'pos': '20', '_neg': '672', '_pos': '719'}
31022550    {'count': '1', 'base': 'T', 'pos': '20', '_neg': '1', '_pos': '0'}
31022550    {'count': '0', 'base': 'N', 'pos': '20', '_neg': '0', '_pos': '0'}
31022550    {'count': '2', 'base': '+A', 'pos': '20', '_neg': '0', '_pos': '2'}
31022551    {'count': '0', 'base': '=', 'pos': '20', '_neg': '0', '_pos': '0'}
31022551    {'count': '960', 'base': 'A', 'pos': '20', '_neg': '464', '_pos': '496'}
31022551    {'count': '0', 'base': 'C', 'pos': '20', '_neg': '0', '_pos': '0'}
31022551    {'count': '13', 'base': 'G', 'pos': '20', '_neg': '9', '_pos': '4'}
31022551    {'count': '0', 'base': 'T', 'pos': '20', '_neg': '0', '_pos': '0'}
31022551    {'count': '0', 'base': 'N', 'pos': '20', '_neg': '0', '_pos': '0'}
31022551    {'count': '288', 'base': '+G', 'pos': '20', '_neg': '117', '_pos': '171'}
31022551    {'count': '9', 'base': '+GG', 'pos': '20', '_neg': '4', '_pos': '5'}
31022551    {'count': '1', 'base': '+GGG', 'pos': '20', '_neg': '0', '_pos': '1'}

这篇关于使用 defaultdict 解析多分隔符文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆