如何从文件中读取两行,并使用python在for循环中创建动态键? [英] How to read two lines from a file and create dynamics keys in a for-loop using python?

查看:170
本文介绍了如何从文件中读取两行,并使用python在for循环中创建动态键?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在以下数据中,我试图运行一个简单的markov模型。



说我有一个具有以下结构的数据: / p>

  pos M1 M2 M3 M4 M5 M6 M7 M8 hybrid_block S1 S2 S3 S4 S5 S6 S7 S8 
1 ATTAAGACA | CCGCTTAGA
2 TGCTGTTGT | AATATCAAT
3 CAACAGTCC | GGACGCGCG
4 GTGTATCTG | TCTTTATCT



<块> 块M 表示来自一组餐厅的数据,因此块S



数据是字符串,它们是沿着位置线连接字母。因此,M1的字符串值为ATCG ,对于其他每个块也是如此。



还有一个混合块有两个以相同方式读取的字符串。 问题是我想要找到混合块中哪个字符串最有可能来自哪个块(M与S)?



我是试图构建一个可以帮助我识别混合块中哪个字符串来自哪个块的马尔可夫模型。在这个例子中,我可以看出,在混合块 ATCG 来自块M CAGT 来自块S





问题级别01:

$ b将问题分解成不同的部分

$ b

  • 首先,我读取第一行(标题),并为所有列创建唯一键 li>
  • 然后我读了第二行( pos ,值为 1 ),并创建另一个键。在同一行中,我从 hybrid_block 读取值,并读取其中的字符串值。 pipe | 只是一个分隔符,所以两个字符串在索引0和2 中作为 A C 。所以,我想从这一行所有的都是一个



defaultdict(< class'dict'> {'M1':['A'],'M2':['T'],'M3':['T'] ....,'hybrid_block':['A'],['C' ...} $ / $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ <<<<<<<<<<<<<<<<<<<



defaultdict(< class'dict'> {'M1':['A','T','C' G'],'M2':['T','G','A','T'],'M3':['T','C','A','G' 。'''''''''''''''''''''''''''''''' c>



问题等级02:




  • hybrid_block 中,第一行是 A和C


  • 现在,我想创建键,但与固定键不同,这些键将在从 hybrid_blocks
    对于第一行,因为没有前一行,
    将只是 AgA CgC 这意味着(给定的A和C给定C),值为我计算 A 的数量块M 块S'。因此,数据将被存储为:




defaultdict(< class'dict' > {'M':{'AgA':[4],'CgC':[1]},'S':{'AgA':2,'CgC':2}}



正如我读通过其他行我想根据混合块中的字符串创建新的键并计算在 M与S 块之间的字符串存在的次数,给定前一行的字符串,这意味着 while line 2 将是 TgA',这意味着(T给定A)和AgC。对于这个键中的值,我计数的次数,在这行中找到 T,之前的行中的A为,而 AcG`的相同。



阅读3行后, defaultdict 将是。



defaultdict (< class'dict'> {'M':{'AgA':4,'TgA':3,'CgT':2},{'CgC':[1],'AgC' 'GGA':0},'S':{'AgA':2,'TgA':1, CgT':0},{'CgC':2,'AgC':2,'GGA':2}}



这看起来太复杂了。我经历了几个字典 defaultdict 教程,但找不到一种方法。



非常感谢任何部分的解决方案。



谢谢,

解决方案

pandas setup



  from io import StringIO 
import pandas as pd
import numpy as np

txt =pos M1 M2 M3 M4 M5 M6 M7 M8 hybrid_block S1 S2 S3 S4 S5 S6 S7 S8
1 ATTAAGACA | CCGCTTAGA
2 TGCTGTTGT | AATATCAAT
3 CAACAGTCC | GGACGCGCG
4 GTGTATCTG | TCTTTATCT

df = pd。 read_csv(StringIO(txt),delim_whitespace = True,index_col ='pos')

df



解决方案



大部分熊猫与一些 numpy







  • split hybrid column

  • 前面加上相同的第一行

  • 添加移位版本的自己以获取'AgA'类型字符串






  d1 = pd.concat([df.loc [[1 ]]。rename(index = {1:0}),df])

d1 = pd.concat([
df.filter(like ='M'),
df.hybrid_block.str.split('|',expand = True).rename(columns ='H {}'。format),
df.filter(like ='S')
] ,轴= 1)

d1 = pd.concat([d1.loc [[1]]。rename(index = {1:0}),d1])$ ​​b $ b d1 = d1 .add('g')。add(d1.shift())。dropna()

d1



为自己的变量名分配方便块

  m = d1.filter(like ='M')
s = d1.filter(like ='S')
h = d1.filter(like ='H')

计算每个块中有多少个并连接

  mcounts = pd.DataFrame(
(m.values [:,,None] == h.values [:, None,:])。sum(1) ,
h.index,h.columns

scounts = pd.DataFrame(
(s.values [:,,None] == h.values [:,没有,:])。sum(1),
h.index,h.columns


计数= pd.concat([mcounts,scounts],axis = 1 ,keys = ['M','S']
计数



如果你真的想要一个字典

  d = defaultdict(lambda:defaultdict(list) )

dict_df = counts.stack()。join(h.stack()。rename('condition'))。unpack()
for pos,row in dict_df.iterrows() :
d ['M'] ['H0']。append((row.loc [('condition','H0')],row.loc [('M','H0')]))
d ['S'] ['H0']。append((row.loc [('condition','H0')]],row.loc [('S','H0')])
d ['M'] ['H1']。append((row.loc [('condition','H1')],row.loc [('M','H1')])
d ['S'] ['H1']。append((row.loc [('condition','H1')],row.loc [('S','H1')]))

dict(d)

{'M':defaultdict(list,
{'H0':[('AgA',4),('TgA',3 ),('CgT',2),('GgC',1)],
'H1':[('CgC',1),('AgC',0),('GgA' ),('TgG',1)]}),
'S':defaultdict(list,
{'H0':[('AgA',2) ('CgC',0),('GgC',0)],
'H1':[('CgC',2),(' GGA',2),('TgG',3)]}}}


In the following data, I am trying to run a simple markov model.

Say I have a data with following structure:

pos   M1  M2  M3  M4  M5  M6  M7  M8  hybrid_block    S1    S2    S3    S4  S5  S6  S7  S8
1     A   T   T   A   A   G   A   C       A|C         C     G     C     T    T   A   G   A
2     T   G   C   T   G   T   T   G       T|A         A     T     A     T    C   A   A   T
3     C   A   A   C   A   G   T   C       C|G         G     A     C     G    C   G   C   G
4     G   T   G   T   A   T   C   T       G|T         C     T     T     T    A   T   C   T 

Block M represents data from one set of catergories, so does block S.

The data are the strings which are made by connecting letter along the position line. So, the string value for M1 is A-T-C-G, and so is for every other block.

There is also one hybrid block that has two string which is read in same way. The question is I want to find which string in the hybrid block most likely came from which block (M vs. S)?

I am trying to build a markov model which can help me identify which string in hybrid block came from which blocks. In this example I can tell that in hybrid block ATCG came from block M and CAGT came from block S.

I am breaking the problem into different parts to read and mine the data:

Problem level 01:

  • First I read the first line (the header) and create unique keys for all the columns.
  • Then I read the 2nd line (pos with value 1) and create another key. In the same line I read the value from hybrid_block and read the strings value in it. The pipe | is just a separator, so two strings are in index 0 and 2 as A and C. So, all I want from this line is a

defaultdict(<class 'dict'>, {'M1': ['A'], 'M2': ['T'], 'M3': ['T']...., 'hybrid_block': ['A'], ['C']...}

As, I progress with reading the line, I want to append the strings values from each column and finally create.

defaultdict(<class 'dict'>, {'M1': ['A', 'T', 'C', 'G'], 'M2': ['T', 'G', 'A', 'T'], 'M3': ['T', 'C', 'A', 'G']...., 'hybrid_block': ['A', 'T', 'C', 'G'], ['C', 'A', 'G', 'T']...}

Problem level 02:

  • I read the data in hybrid_block for the first line which are A and C.

  • Now, I want to create keys' but unlike fixed keys, these key will be generated while reading the data fromhybrid_blocks. For the first line since there are no preceding line thekeyswill simply beAgAandCgCwhich means (A given A, and C given C), and for the values I count the number ofAinblock Mandblock S`. So, the data will be stored as:

defaultdict(<class 'dict'>, {'M': {'AgA': [4], 'CgC': [1]}, 'S': {'AgA': 2, 'CgC': 2}}

As, I read through other lines I want to create new keys based on what are the strings in hybrid block and count the number of times that string was present in M vs S block given the string in preceeding line. That means the keys while reading line 2 would be TgA' which means (T given A) and AgC. For the values inside this key I count the number of times I foundT in this line, after A in the previous lineand same forAcG`.

The defaultdict after reading 3 lines would be.

defaultdict(<class 'dict'>, {'M': {'AgA': 4, 'TgA':3, 'CgT':2}, {'CgC': [1], 'AgC':0, 'GgA':0}, 'S': {'AgA': 2, 'TgA':1, 'CgT':0}, {'CgC': 2, 'AgC':2, 'GgA':2}}

I understand this looks too complicated. I went through several dictionary and defaultdict tutorial but couldn't find a way of doing this.

Solution to any part if not both is highly appreciated.

Thanks,

解决方案

pandas setup

from io import StringIO
import pandas as pd
import numpy as np

txt = """pos   M1  M2  M3  M4  M5  M6  M7  M8  hybrid_block    S1    S2    S3    S4  S5  S6  S7  S8
1     A   T   T   A   A   G   A   C       A|C         C     G     C     T    T   A   G   A
2     T   G   C   T   G   T   T   G       T|A         A     T     A     T    C   A   A   T
3     C   A   A   C   A   G   T   C       C|G         G     A     C     G    C   G   C   G
4     G   T   G   T   A   T   C   T       G|T         C     T     T     T    A   T   C   T """

df = pd.read_csv(StringIO(txt), delim_whitespace=True, index_col='pos')

df

solution

mostly pandas with some numpy


  • split hybrid column
  • prepend identical first row
  • add with shifted version of self to get 'AgA' type strings

d1 = pd.concat([df.loc[[1]].rename(index={1: 0}), df])

d1 = pd.concat([
        df.filter(like='M'),
        df.hybrid_block.str.split('|', expand=True).rename(columns='H{}'.format),
        df.filter(like='S')
    ], axis=1)

d1 = pd.concat([d1.loc[[1]].rename(index={1: 0}), d1])
d1 = d1.add('g').add(d1.shift()).dropna()

d1

Assign convenient blocks to their own variable names

m = d1.filter(like='M')
s = d1.filter(like='S')
h = d1.filter(like='H')

Count how many are in each block and concatenate

mcounts = pd.DataFrame(
    (m.values[:, :, None] == h.values[:, None, :]).sum(1),
    h.index, h.columns
)
scounts = pd.DataFrame(
    (s.values[:, :, None] == h.values[:, None, :]).sum(1),
    h.index, h.columns
)

counts = pd.concat([mcounts, scounts], axis=1, keys=['M', 'S'])
counts

If you really want a dictionary

d = defaultdict(lambda:defaultdict(list))

dict_df = counts.stack().join(h.stack().rename('condition')).unstack()
for pos, row in dict_df.iterrows():
    d['M']['H0'].append((row.loc[('condition', 'H0')], row.loc[('M', 'H0')]))
    d['S']['H0'].append((row.loc[('condition', 'H0')], row.loc[('S', 'H0')]))
    d['M']['H1'].append((row.loc[('condition', 'H1')], row.loc[('M', 'H1')]))
    d['S']['H1'].append((row.loc[('condition', 'H1')], row.loc[('S', 'H1')]))

dict(d)

{'M': defaultdict(list,
             {'H0': [('AgA', 4), ('TgA', 3), ('CgT', 2), ('GgC', 1)],
              'H1': [('CgC', 1), ('AgC', 0), ('GgA', 0), ('TgG', 1)]}),
 'S': defaultdict(list,
             {'H0': [('AgA', 2), ('TgA', 1), ('CgT', 0), ('GgC', 0)],
              'H1': [('CgC', 2), ('AgC', 2), ('GgA', 2), ('TgG', 3)]})}

这篇关于如何从文件中读取两行,并使用python在for循环中创建动态键?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆