如何从文件中读取两行，并使用python在for循环中创建动态键？ [英] How to read two lines from a file and create dynamics keys in a for-loop using python?

查看：170 发布时间：2017/5/21 16:47:14 python pandas numpy dictionary defaultdict

本文介绍了如何从文件中读取两行，并使用python在for循环中创建动态键？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在以下数据中，我试图运行一个简单的markov模型。

说我有一个具有以下结构的数据： / p>

  pos M1 M2 M3 M4 M5 M6 M7 M8 hybrid_block S1 S2 S3 S4 S5 S6 S7 S8 
 1 ATTAAGACA | CCGCTTAGA 
 2 TGCTGTTGT | AATATCAAT 
 3 CAACAGTCC | GGACGCGCG 
 4 GTGTATCTG | TCTTTATCT

<块> 块M 表示来自一组餐厅的数据，因此块S 。

数据是字符串，它们是沿着位置线连接字母。因此，M1的字符串值为ATCG ，对于其他每个块也是如此。

还有一个混合块有两个以相同方式读取的字符串。 问题是我想要找到混合块中哪个字符串最有可能来自哪个块（M与S）？

我是试图构建一个可以帮助我识别混合块中哪个字符串来自哪个块的马尔可夫模型。在这个例子中，我可以看出，在混合块 ATCG 来自块M 和 CAGT 来自块S 。

我

问题级别01：
$ b将问题分解成不同的部分
$ b

首先，我读取第一行（标题），并为所有列创建唯一键 li>
然后我读了第二行（ pos ，值为 1 ），并创建另一个键。在同一行中，我从 hybrid_block 读取值，并读取其中的字符串值。 pipe | 只是一个分隔符，所以两个字符串在索引0和2 中作为 A 和 C 。所以，我想从这一行所有的都是一个

defaultdict（< class'dict'> {'M1'：['A']，'M2'：['T']，'M3'：['T'] ....，'hybrid_block'：['A']，['C' ...} $ / $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ <<<<<<<<<<<<<<<<<<<

 
 
   defaultdict（< class'dict'> {'M1'：['A'，'T'，'C' G']，'M2'：['T'，'G'，'A'，'T']，'M3'：['T'，'C'，'A'，'G' 。'''''''''''''''''''''''''''''''' c> 
 
 
 问题等级02：
 
 
  
  在 hybrid_block 中，第一行是 A和C 。
 
 
  现在，我想创建键，但与固定键不同，这些键将在从 hybrid_blocks 。
对于第一行，因为没有前一行，键将只是 AgA 和 CgC 这意味着（给定的A和C给定C），值为我计算 A 的数量块M 和块S'。因此，数据将被存储为：
 
 
 
 
 
   defaultdict（< class'dict' > {'M'：{'AgA'：[4]，'CgC'：[1]}，'S'：{'AgA'：2，'CgC'：2}}  
 
 
 正如我读通过其他行我想根据混合块中的字符串创建新的键并计算在 M与S 块之间的字符串存在的次数，给定前一行的字符串，这意味着键 while  line 2 将是 TgA'，这意味着（T给定A）和AgC。对于这个键中的值，我计数的次数，在这行中找到 T，之前的行中的A为，而 AcG`的相同。
 
 
 阅读3行后， defaultdict 将是。
 
 
   defaultdict （< class'dict'> {'M'：{'AgA'：4，'TgA'：3，'CgT'：2}，{'CgC'：[1]，'AgC' 'GGA'：0}，'S'：{'AgA'：2，'TgA'：1， CgT'：0}，{'CgC'：2，'AgC'：2，'GGA'：2}}  
 
 
 这看起来太复杂了。我经历了几个字典和 defaultdict 教程，但找不到一种方法。
 
 
 非常感谢任何部分的解决方案。
 
 
 谢谢，
解决方案
 
  pandas  setup 
 
 
 
  from io import StringIO 
 import pandas as pd 
 import numpy as np 
 
 txt =pos M1 M2 M3 M4 M5 M6 M7 M8 hybrid_block S1 S2 S3 S4 S5 S6 S7 S8 
 1 ATTAAGACA | CCGCTTAGA 
 2 TGCTGTTGT | AATATCAAT 
 3 CAACAGTCC | GGACGCGCG 
 4 GTGTATCTG | TCTTTATCT
 
 df = pd。 read_csv（StringIO（txt），delim_whitespace = True，index_col ='pos'）
 
 df 
  
  
 
 
 解决方案
 
 
 大部分熊猫与一些 numpy  
 
 
 
 
 
  
  split hybrid column 
 
 前面加上相同的第一行
 
 添加移位版本的自己以获取'AgA'类型字符串
 
 
 
 
 
 
 
 
  d1 = pd.concat（[df.loc [[1 ]]。rename（index = {1：0}），df]）
 
 d1 = pd.concat（[
 df.filter（like ='M'），
 df.hybrid_block.str.split（'|'，expand = True）.rename（columns ='H {}'。format），
 df.filter（like ='S'）
] ，轴= 1）
 
 d1 = pd.concat（[d1.loc [[1]]。rename（index = {1：0}），d1]）$ b $ b d1 = d1 .add（'g'）。add（d1.shift（））。dropna（）
 
 d1 
  
  
 
 
 为自己的变量名分配方便块
  m = d1.filter（like ='M'）
s = d1.filter（like ='S'）
h = d1.filter（like ='H'）
  
计算每个块中有多少个并连接
  mcounts = pd.DataFrame（
（m.values [:,，None] == h.values [:, None，：]）。sum（1） ，
 h.index，h.columns 
）
 scounts = pd.DataFrame（
（s.values [:,，None] == h.values [:,没有，：]）。sum（1），
 h.index，h.columns 
）
 
计数= pd.concat（[mcounts，scounts]，axis = 1 ，keys = ['M'，'S'] 
计数
  
  
 
 
 如果你真的想要一个字典
  d = defaultdict（lambda：defaultdict（list） ）
 
 dict_df = counts.stack（）。join（h.stack（）。rename（'condition'））。unpack（）
 for pos，row in dict_df.iterrows（） ：
d ['M'] ['H0']。append（（row.loc [（'condition'，'H0'）]，row.loc [（'M'，'H0'）]）） 
d ['S'] ['H0']。append（（row.loc [（'condition'，'H0'）]]，row.loc [（'S'，'H0'）]）
d ['M'] ['H1']。append（（row.loc [（'condition'，'H1'）]，row.loc [（'M'，'H1'）]）
d ['S'] ['H1']。append（（row.loc [（'condition'，'H1'）]，row.loc [（'S'，'H1'）]））
 
 dict（d）
 
 {'M'：defaultdict（list，
 {'H0'：[（'AgA'，4），（'TgA'，3 ），（'CgT'，2），（'GgC'，1）]，
'H1'：[（'CgC'，1），（'AgC'，0），（'GgA' ），（'TgG'，1）]}），
'S'：defaultdict（list，
 {'H0'：[（'AgA'，2） （'CgC'，0），（'GgC'，0）]，
'H1'：[（'CgC'，2），（' GGA'，2），（'TgG'，3）]}}} 
  
 
In the following data, I am trying to run a simple markov model.

Say I have a data with following structure:
pos   M1  M2  M3  M4  M5  M6  M7  M8  hybrid_block    S1    S2    S3    S4  S5  S6  S7  S8
1     A   T   T   A   A   G   A   C       A|C         C     G     C     T    T   A   G   A
2     T   G   C   T   G   T   T   G       T|A         A     T     A     T    C   A   A   T
3     C   A   A   C   A   G   T   C       C|G         G     A     C     G    C   G   C   G
4     G   T   G   T   A   T   C   T       G|T         C     T     T     T    A   T   C   T 
Block M represents data from one set of catergories, so does block S.

The data are the strings which are made by connecting letter along the position line. So, the string value for M1 is A-T-C-G, and so is for every other block.

There is also one hybrid block that has two string which is read in same way. The question is I want to find which string in the hybrid block most likely came from which block (M vs. S)?

I am trying to build a markov model which can help me identify which string in hybrid block came from which blocks. In this example I can tell that in hybrid block ATCG came from block M and CAGT came from block S.

I am breaking the problem into different parts to read and mine the data:

Problem level 01:


First I read the first line (the header) and create unique keys for all the columns.
Then I read the 2nd line (pos with value 1) and create another key. In the same line I read the value from hybrid_block and read the strings value in it. The pipe | is just a separator, so two strings are in index 0 and 2 as A and C. So, all I want from this line is a 


defaultdict(<class 'dict'>, {'M1': ['A'], 'M2': ['T'], 'M3': ['T']...., 'hybrid_block': ['A'], ['C']...}

As, I progress with reading the line, I want to append the strings values from each column and finally create.

defaultdict(<class 'dict'>, {'M1': ['A', 'T', 'C', 'G'], 'M2': ['T', 'G', 'A', 'T'], 'M3': ['T', 'C', 'A', 'G']...., 'hybrid_block': ['A', 'T', 'C', 'G'], ['C', 'A', 'G', 'T']...}

Problem level 02:


I read the data in hybrid_block for the first line which are A and C.
Now, I want to create keys' but unlike fixed keys, these key will be generated while reading the data fromhybrid_blocks.
For the first line since there are no preceding line thekeyswill simply beAgAandCgCwhich means (A given A, and C given C), and for the values I count the number ofAinblock Mandblock S`. So, the data will be stored as:


defaultdict(<class 'dict'>, {'M': {'AgA': [4], 'CgC': [1]}, 'S': {'AgA': 2, 'CgC': 2}}

As, I read through other lines I want to create new keys based on what are the strings in hybrid block and count the number of times that string was present in M vs S block given the string in preceeding line. That means the keys while reading line 2 would be TgA' which means (T given A) and AgC. For the values inside this key I count the number of times I foundT in this line, after A in the previous lineand same forAcG`.

The defaultdict after reading 3 lines would be.

defaultdict(<class 'dict'>, {'M': {'AgA': 4, 'TgA':3, 'CgT':2}, {'CgC': [1], 'AgC':0, 'GgA':0}, 'S': {'AgA': 2, 'TgA':1, 'CgT':0}, {'CgC': 2, 'AgC':2, 'GgA':2}}

I understand this looks too complicated. I went through several dictionary and defaultdict tutorial but couldn't find a way of doing this.

Solution to any part if not both is highly appreciated.

Thanks,
 解决方案 
pandas setup

from io import StringIO
import pandas as pd
import numpy as np

txt = """pos   M1  M2  M3  M4  M5  M6  M7  M8  hybrid_block    S1    S2    S3    S4  S5  S6  S7  S8
1     A   T   T   A   A   G   A   C       A|C         C     G     C     T    T   A   G   A
2     T   G   C   T   G   T   T   G       T|A         A     T     A     T    C   A   A   T
3     C   A   A   C   A   G   T   C       C|G         G     A     C     G    C   G   C   G
4     G   T   G   T   A   T   C   T       G|T         C     T     T     T    A   T   C   T """

df = pd.read_csv(StringIO(txt), delim_whitespace=True, index_col='pos')

df


solution

mostly pandas with some numpy




split hybrid column
prepend identical first row
add with shifted version of self to get 'AgA' type strings




d1 = pd.concat([df.loc[[1]].rename(index={1: 0}), df])

d1 = pd.concat([
        df.filter(like='M'),
        df.hybrid_block.str.split('|', expand=True).rename(columns='H{}'.format),
        df.filter(like='S')
    ], axis=1)

d1 = pd.concat([d1.loc[[1]].rename(index={1: 0}), d1])
d1 = d1.add('g').add(d1.shift()).dropna()

d1


Assign convenient blocks to their own variable names
m = d1.filter(like='M')
s = d1.filter(like='S')
h = d1.filter(like='H')
Count how many are in each block and concatenate
mcounts = pd.DataFrame(
    (m.values[:, :, None] == h.values[:, None, :]).sum(1),
    h.index, h.columns
)
scounts = pd.DataFrame(
    (s.values[:, :, None] == h.values[:, None, :]).sum(1),
    h.index, h.columns
)

counts = pd.concat([mcounts, scounts], axis=1, keys=['M', 'S'])
counts
 

If you really want a dictionary
d = defaultdict(lambda:defaultdict(list))

dict_df = counts.stack().join(h.stack().rename('condition')).unstack()
for pos, row in dict_df.iterrows():
    d['M']['H0'].append((row.loc[('condition', 'H0')], row.loc[('M', 'H0')]))
    d['S']['H0'].append((row.loc[('condition', 'H0')], row.loc[('S', 'H0')]))
    d['M']['H1'].append((row.loc[('condition', 'H1')], row.loc[('M', 'H1')]))
    d['S']['H1'].append((row.loc[('condition', 'H1')], row.loc[('S', 'H1')]))

dict(d)

{'M': defaultdict(list,
             {'H0': [('AgA', 4), ('TgA', 3), ('CgT', 2), ('GgC', 1)],
              'H1': [('CgC', 1), ('AgC', 0), ('GgA', 0), ('TgG', 1)]}),
 'S': defaultdict(list,
             {'H0': [('AgA', 2), ('TgA', 1), ('CgT', 0), ('GgC', 0)],
              'H1': [('CgC', 2), ('AgC', 2), ('GgA', 2), ('TgG', 3)]})}


                        
这篇关于如何从文件中读取两行，并使用python在for循环中创建动态键？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

如何从文件中读取两行，并使用python在for循环中创建动态键？ [英] How to read two lines from a file and create dynamics keys in a for-loop using python?

问题描述

`pandas` setup

解决方案

大部分`熊猫`与一些 `numpy`

`pandas` setup

solution

mostly `pandas` with some `numpy`

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何从文件中读取两行，并使用python在for循环中创建动态键？ [英] How to read two lines from a file and create dynamics keys in a for-loop using python?

问题描述

pandas setup

解决方案

大部分熊猫与一些 numpy

pandas setup

solution

mostly pandas with some numpy

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

`pandas` setup

大部分`熊猫`与一些 `numpy`

`pandas` setup

mostly `pandas` with some `numpy`

登录关闭