如何从文件中读取两行并在for循环中创建动态键? [英] How to read two lines from a file and create dynamics keys in a for-loop?

查看:71
本文介绍了如何从文件中读取两行并在for循环中创建动态键?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在以下数据中,我试图运行一个简单的markov模型.

In the following data, I am trying to run a simple markov model.

说我有一个具有以下结构的数据:

pos   M1  M2  M3  M4  M5  M6  M7  M8  hybrid_block    S1    S2    S3    S4  S5  S6  S7  S8
1     A   T   T   A   A   G   A   C       A|C         C     G     C     T    T   A   G   A
2     T   G   C   T   G   T   T   G       T|A         A     T     A     T    C   A   A   T
3     C   A   A   C   A   G   T   C       C|G         G     A     C     G    C   G   C   G
4     G   T   G   T   A   T   C   T       G|T         C     T     T     T    A   T   C   T 

块M 代表一组类别的数据,块S 也是如此.

Block M represents data from one set of catergories, so does block S.

数据是通过沿位置线连接字母而形成的strings.因此,M1的字符串值是A-T-C-G ,其他每个块也是如此.

The data are the strings which are made by connecting letter along the position line. So, the string value for M1 is A-T-C-G, and so is for every other block.

还有一个hybrid block,其中有两个以相同方式读取的字符串. 问题是我想找出混合块中的哪个字符串最有可能来自哪个块(M对S)?

There is also one hybrid block that has two string which is read in same way. The question is I want to find which string in the hybrid block most likely came from which block (M vs. S)?

我正在尝试建立一个markov模型,该模型可以帮助我识别hybrid block中的哪个字符串来自哪个块.在此示例中,我可以说在混合块中,ATCG来自block M,而CAGT来自block S.

I am trying to build a markov model which can help me identify which string in hybrid block came from which blocks. In this example I can tell that in hybrid block ATCG came from block M and CAGT came from block S.

我将问题分为不同的部分以读取和挖掘数据:

问题级别01:

  • 首先,我阅读第一行(标题),并为所有列创建unique keys.
  • 然后我阅读第二行(值 1 pos)并创建另一个密钥.在同一行中,我从hybrid_block中读取值并读取其中的字符串值. pipe |只是一个分隔符,因此index 0 and 2中的两个字符串分别为AC.所以,我要从这行开始的是
  • First I read the first line (the header) and create unique keys for all the columns.
  • Then I read the 2nd line (pos with value 1) and create another key. In the same line I read the value from hybrid_block and read the strings value in it. The pipe | is just a separator, so two strings are in index 0 and 2 as A and C. So, all I want from this line is a

defaultdict(<class 'dict'>, {'M1': ['A'], 'M2': ['T'], 'M3': ['T']...., 'hybrid_block': ['A'], ['C']...}

现在,我继续阅读该行,我想从每一列中附加字符串值,最后创建.

As, I progress with reading the line, I want to append the strings values from each column and finally create.

defaultdict(<class 'dict'>, {'M1': ['A', 'T', 'C', 'G'], 'M2': ['T', 'G', 'A', 'T'], 'M3': ['T', 'C', 'A', 'G']...., 'hybrid_block': ['A', 'T', 'C', 'G'], ['C', 'A', 'G', 'T']...}

问题级别02:

  • 我读取hybrid_block中第一行的数据,它们是A and C.

  • I read the data in hybrid_block for the first line which are A and C.

现在,我想创建keys' but unlike fixed keys, these key will be generated while reading the data from hybrid_blocks . For the first line since there are no preceding line the keys will simply be AgA and CgC which means (A given A, and C given C), and for the values I count the number of A in block M and block S`.因此,数据将存储为:

Now, I want to create keys' but unlike fixed keys, these key will be generated while reading the data fromhybrid_blocks. For the first line since there are no preceding line thekeyswill simply beAgAandCgCwhich means (A given A, and C given C), and for the values I count the number ofAinblock Mandblock S`. So, the data will be stored as:

defaultdict(<class 'dict'>, {'M': {'AgA': [4], 'CgC': [1]}, 'S': {'AgA': 2, 'CgC': 2}}

是的,我通读了其他各行,我想根据hybrid block中的字符串创建新的键,并在给定该行中的字符串的情况下,计算该字符串在M vs S块中存在的次数.这意味着在读取line 2keys在该行中将是TgA' which means (T given A) and AgC. For the values inside this key I count the number of times I found T,在前一行and same for AcG`中位于A之后.

As, I read through other lines I want to create new keys based on what are the strings in hybrid block and count the number of times that string was present in M vs S block given the string in preceeding line. That means the keys while reading line 2 would be TgA' which means (T given A) and AgC. For the values inside this key I count the number of times I foundT in this line, after A in the previous lineand same forAcG`.

读取3行后的defaultdict将是

defaultdict(<class 'dict'>, {'M': {'AgA': 4, 'TgA':3, 'CgT':2}, {'CgC': [1], 'AgC':0, 'GgA':0}, 'S': {'AgA': 2, 'TgA':1, 'CgT':0}, {'CgC': 2, 'AgC':2, 'GgA':2}}

我知道这看起来太复杂了.我经历了几个dictionarydefaultdict教程,但是找不到做到这一点的方法.

I understand this looks too complicated. I went through several dictionary and defaultdict tutorial but couldn't find a way of doing this.

高度赞赏解决所有问题的方法.

Solution to any part if not both is highly appreciated.

推荐答案

pandas设置

from io import StringIO
import pandas as pd
import numpy as np

txt = """pos   M1  M2  M3  M4  M5  M6  M7  M8  hybrid_block    S1    S2    S3    S4  S5  S6  S7  S8
1     A   T   T   A   A   G   A   C       A|C         C     G     C     T    T   A   G   A
2     T   G   C   T   G   T   T   G       T|A         A     T     A     T    C   A   A   T
3     C   A   A   C   A   G   T   C       C|G         G     A     C     G    C   G   C   G
4     G   T   G   T   A   T   C   T       G|T         C     T     T     T    A   T   C   T """

df = pd.read_csv(StringIO(txt), delim_whitespace=True, index_col='pos')

df

  • 拆分混合列
  • 在相同的第一行之前加上
  • 添加self的转换版本以获取'AgA'类型的字符串
  • split hybrid column
  • prepend identical first row
  • add with shifted version of self to get 'AgA' type strings
d1 = pd.concat([df.loc[[1]].rename(index={1: 0}), df])

d1 = pd.concat([
        df.filter(like='M'),
        df.hybrid_block.str.split('|', expand=True).rename(columns='H{}'.format),
        df.filter(like='S')
    ], axis=1)

d1 = pd.concat([d1.loc[[1]].rename(index={1: 0}), d1])
d1 = d1.add('g').add(d1.shift()).dropna()

d1

为方便的块分配自己的变量名

Assign convenient blocks to their own variable names

m = d1.filter(like='M')
s = d1.filter(like='S')
h = d1.filter(like='H')

计算每个块中有多少并连接

Count how many are in each block and concatenate

mcounts = pd.DataFrame(
    (m.values[:, :, None] == h.values[:, None, :]).sum(1),
    h.index, h.columns
)
scounts = pd.DataFrame(
    (s.values[:, :, None] == h.values[:, None, :]).sum(1),
    h.index, h.columns
)

counts = pd.concat([mcounts, scounts], axis=1, keys=['M', 'S'])
counts

如果您真的想要字典

d = defaultdict(lambda:defaultdict(list))

dict_df = counts.stack().join(h.stack().rename('condition')).unstack()
for pos, row in dict_df.iterrows():
    d['M']['H0'].append((row.loc[('condition', 'H0')], row.loc[('M', 'H0')]))
    d['S']['H0'].append((row.loc[('condition', 'H0')], row.loc[('S', 'H0')]))
    d['M']['H1'].append((row.loc[('condition', 'H1')], row.loc[('M', 'H1')]))
    d['S']['H1'].append((row.loc[('condition', 'H1')], row.loc[('S', 'H1')]))

dict(d)

{'M': defaultdict(list,
             {'H0': [('AgA', 4), ('TgA', 3), ('CgT', 2), ('GgC', 1)],
              'H1': [('CgC', 1), ('AgC', 0), ('GgA', 0), ('TgG', 1)]}),
 'S': defaultdict(list,
             {'H0': [('AgA', 2), ('TgA', 1), ('CgT', 0), ('GgC', 0)],
              'H1': [('CgC', 2), ('AgC', 2), ('GgA', 2), ('TgG', 3)]})}

这篇关于如何从文件中读取两行并在for循环中创建动态键?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆