如何从文件中读取两行,并使用python在for循环中创建动态键? [英] How to read two lines from a file and create dynamics keys in a for-loop using python?
问题描述
在以下数据中,我试图运行一个简单的markov模型。
说我有一个具有以下结构的数据: / p>
pos M1 M2 M3 M4 M5 M6 M7 M8 hybrid_block S1 S2 S3 S4 S5 S6 S7 S8
1 ATTAAGACA | CCGCTTAGA
2 TGCTGTTGT | AATATCAAT
3 CAACAGTCC | GGACGCGCG
4 GTGTATCTG | TCTTTATCT
<块> 块M 表示来自一组餐厅的数据,因此块S 。
数据是字符串
,它们是沿着位置线连接字母。因此,M1的字符串值为ATCG ,对于其他每个块也是如此。
还有一个混合块
有两个以相同方式读取的字符串。 问题是我想要找到混合块中哪个字符串最有可能来自哪个块(M与S)?
我是试图构建一个可以帮助我识别混合块
中哪个字符串来自哪个块的马尔可夫模型。在这个例子中,我可以看出,在混合块 ATCG
来自块M
和 CAGT
来自块S
。
我
问题级别01:
$ b将问题分解成不同的部分$ b
- 首先,我读取第一行(标题),并为所有列创建
唯一键
li>
- 然后我读了第二行(
pos
,值为 1 ),并创建另一个键。在同一行中,我从hybrid_block
读取值,并读取其中的字符串值。pipe |
只是一个分隔符,所以两个字符串在索引0和2
中作为A
和C
。所以,我想从这一行所有的都是一个
defaultdict(< class'dict'> {'M1':['A'],'M2':['T'],'M3':['T'] ....,'hybrid_block':['A'],['C' ...} $ / $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ <<<<<<<<<<<<<<<<<<<
defaultdict(< class'dict'> {'M1':['A','T','C' G'],'M2':['T','G','A','T'],'M3':['T','C','A','G' 。'''''''''''''''''''''''''''''''' c>
问题等级02:
-
在
hybrid_block
中,第一行是A和C
。 -
现在,我想创建
键,但与固定键不同,这些键将在从
hybrid_blocks。
键
对于第一行,因为没有前一行,将只是
AgA和
CgC这意味着(给定的A和C给定C),值为我计算
A的数量
块M和
块S'。因此,数据将被存储为:
defaultdict(< class'dict' > {'M':{'AgA':[4],'CgC':[1]},'S':{'AgA':2,'CgC':2}}
正如我读通过其他行我想根据混合块中的字符串创建新的键
并计算在 M与S
块之间的字符串存在的次数,给定前一行的字符串,这意味着键
while line 2
将是 TgA',这意味着(T给定A)和AgC。对于这个键中的值,我计数的次数,在这行中找到
T,之前的行中的A为,而
AcG`的相同。
阅读3行后, defaultdict
将是。
defaultdict (< class'dict'> {'M':{'AgA':4,'TgA':3,'CgT':2},{'CgC':[1],'AgC' 'GGA':0},'S':{'AgA':2,'TgA':1, CgT':0},{'CgC':2,'AgC':2,'GGA':2}}
这看起来太复杂了。我经历了几个字典
和 defaultdict
教程,但找不到一种方法。
非常感谢任何部分的解决方案。
谢谢,
pandas
setup
from io import StringIO
import pandas as pd
import numpy as np
txt =pos M1 M2 M3 M4 M5 M6 M7 M8 hybrid_block S1 S2 S3 S4 S5 S6 S7 S8
1 ATTAAGACA | CCGCTTAGA
2 TGCTGTTGT | AATATCAAT
3 CAACAGTCC | GGACGCGCG
4 GTGTATCTG | TCTTTATCT
df = pd。 read_csv(StringIO(txt),delim_whitespace = True,index_col ='pos')
df
解决方案
大部分熊猫
与一些 numpy
- split hybrid column
- 前面加上相同的第一行
- 添加移位版本的自己以获取
'AgA'
类型字符串
d1 = pd.concat([df.loc [[1 ]]。rename(index = {1:0}),df])
d1 = pd.concat([
df.filter(like ='M'),
df.hybrid_block.str.split('|',expand = True).rename(columns ='H {}'。format),
df.filter(like ='S')
] ,轴= 1)
d1 = pd.concat([d1.loc [[1]]。rename(index = {1:0}),d1])$ b $ b d1 = d1 .add('g')。add(d1.shift())。dropna()
d1
为自己的变量名分配方便块
m = d1.filter(like ='M')
s = d1.filter(like ='S')
h = d1.filter(like ='H')
计算每个块中有多少个并连接
mcounts = pd.DataFrame(
(m.values [:,,None] == h.values [:, None,:])。sum(1) ,
h.index,h.columns
)
scounts = pd.DataFrame(
(s.values [:,,None] == h.values [:,没有,:])。sum(1),
h.index,h.columns
)
计数= pd.concat([mcounts,scounts],axis = 1 ,keys = ['M','S']
计数
如果你真的想要一个字典
d = defaultdict(lambda:defaultdict(list) )
dict_df = counts.stack()。join(h.stack()。rename('condition'))。unpack()
for pos,row in dict_df.iterrows() :
d ['M'] ['H0']。append((row.loc [('condition','H0')],row.loc [('M','H0')]))
d ['S'] ['H0']。append((row.loc [('condition','H0')]],row.loc [('S','H0')])
d ['M'] ['H1']。append((row.loc [('condition','H1')],row.loc [('M','H1')])
d ['S'] ['H1']。append((row.loc [('condition','H1')],row.loc [('S','H1')]))
dict(d)
{'M':defaultdict(list,
{'H0':[('AgA',4),('TgA',3 ),('CgT',2),('GgC',1)],
'H1':[('CgC',1),('AgC',0),('GgA' ),('TgG',1)]}),
'S':defaultdict(list,
{'H0':[('AgA',2) ('CgC',0),('GgC',0)],
'H1':[('CgC',2),(' GGA',2),('TgG',3)]}}}
In the following data, I am trying to run a simple markov model.
Say I have a data with following structure:
pos M1 M2 M3 M4 M5 M6 M7 M8 hybrid_block S1 S2 S3 S4 S5 S6 S7 S8
1 A T T A A G A C A|C C G C T T A G A
2 T G C T G T T G T|A A T A T C A A T
3 C A A C A G T C C|G G A C G C G C G
4 G T G T A T C T G|T C T T T A T C T
Block M represents data from one set of catergories, so does block S.
The data are the strings
which are made by connecting letter along the position line. So, the string value for M1 is A-T-C-G, and so is for every other block.
There is also one hybrid block
that has two string which is read in same way. The question is I want to find which string in the hybrid block most likely came from which block (M vs. S)?
I am trying to build a markov model which can help me identify which string in hybrid block
came from which blocks. In this example I can tell that in hybrid block ATCG
came from block M
and CAGT
came from block S
.
I am breaking the problem into different parts to read and mine the data:
Problem level 01:
- First I read the first line (the header) and create
unique keys
for all the columns. - Then I read the 2nd line (
pos
with value 1) and create another key. In the same line I read the value fromhybrid_block
and read the strings value in it. Thepipe |
is just a separator, so two strings are inindex 0 and 2
asA
andC
. So, all I want from this line is a
defaultdict(<class 'dict'>, {'M1': ['A'], 'M2': ['T'], 'M3': ['T']...., 'hybrid_block': ['A'], ['C']...}
As, I progress with reading the line, I want to append the strings values from each column and finally create.
defaultdict(<class 'dict'>, {'M1': ['A', 'T', 'C', 'G'], 'M2': ['T', 'G', 'A', 'T'], 'M3': ['T', 'C', 'A', 'G']...., 'hybrid_block': ['A', 'T', 'C', 'G'], ['C', 'A', 'G', 'T']...}
Problem level 02:
I read the data in
hybrid_block
for the first line which areA and C
.Now, I want to create
keys' but unlike fixed keys, these key will be generated while reading the data from
hybrid_blocks. For the first line since there are no preceding line the
keyswill simply be
AgAand
CgCwhich means (A given A, and C given C), and for the values I count the number of
Ain
block Mand
block S`. So, the data will be stored as:
defaultdict(<class 'dict'>, {'M': {'AgA': [4], 'CgC': [1]}, 'S': {'AgA': 2, 'CgC': 2}}
As, I read through other lines I want to create new keys based on what are the strings in hybrid block
and count the number of times that string was present in M vs S
block given the string in preceeding line. That means the keys
while reading line 2
would be TgA' which means (T given A) and AgC. For the values inside this key I count the number of times I found
T in this line, after A in the previous lineand same for
AcG`.
The defaultdict
after reading 3 lines would be.
defaultdict(<class 'dict'>, {'M': {'AgA': 4, 'TgA':3, 'CgT':2}, {'CgC': [1], 'AgC':0, 'GgA':0}, 'S': {'AgA': 2, 'TgA':1, 'CgT':0}, {'CgC': 2, 'AgC':2, 'GgA':2}}
I understand this looks too complicated. I went through several dictionary
and defaultdict
tutorial but couldn't find a way of doing this.
Solution to any part if not both is highly appreciated.
Thanks,
pandas
setup
from io import StringIO
import pandas as pd
import numpy as np
txt = """pos M1 M2 M3 M4 M5 M6 M7 M8 hybrid_block S1 S2 S3 S4 S5 S6 S7 S8
1 A T T A A G A C A|C C G C T T A G A
2 T G C T G T T G T|A A T A T C A A T
3 C A A C A G T C C|G G A C G C G C G
4 G T G T A T C T G|T C T T T A T C T """
df = pd.read_csv(StringIO(txt), delim_whitespace=True, index_col='pos')
df
solution
mostly pandas
with some numpy
- split hybrid column
- prepend identical first row
- add with shifted version of self to get
'AgA'
type strings
d1 = pd.concat([df.loc[[1]].rename(index={1: 0}), df])
d1 = pd.concat([
df.filter(like='M'),
df.hybrid_block.str.split('|', expand=True).rename(columns='H{}'.format),
df.filter(like='S')
], axis=1)
d1 = pd.concat([d1.loc[[1]].rename(index={1: 0}), d1])
d1 = d1.add('g').add(d1.shift()).dropna()
d1
Assign convenient blocks to their own variable names
m = d1.filter(like='M')
s = d1.filter(like='S')
h = d1.filter(like='H')
Count how many are in each block and concatenate
mcounts = pd.DataFrame(
(m.values[:, :, None] == h.values[:, None, :]).sum(1),
h.index, h.columns
)
scounts = pd.DataFrame(
(s.values[:, :, None] == h.values[:, None, :]).sum(1),
h.index, h.columns
)
counts = pd.concat([mcounts, scounts], axis=1, keys=['M', 'S'])
counts
If you really want a dictionary
d = defaultdict(lambda:defaultdict(list))
dict_df = counts.stack().join(h.stack().rename('condition')).unstack()
for pos, row in dict_df.iterrows():
d['M']['H0'].append((row.loc[('condition', 'H0')], row.loc[('M', 'H0')]))
d['S']['H0'].append((row.loc[('condition', 'H0')], row.loc[('S', 'H0')]))
d['M']['H1'].append((row.loc[('condition', 'H1')], row.loc[('M', 'H1')]))
d['S']['H1'].append((row.loc[('condition', 'H1')], row.loc[('S', 'H1')]))
dict(d)
{'M': defaultdict(list,
{'H0': [('AgA', 4), ('TgA', 3), ('CgT', 2), ('GgC', 1)],
'H1': [('CgC', 1), ('AgC', 0), ('GgA', 0), ('TgG', 1)]}),
'S': defaultdict(list,
{'H0': [('AgA', 2), ('TgA', 1), ('CgT', 0), ('GgC', 0)],
'H1': [('CgC', 2), ('AgC', 2), ('GgA', 2), ('TgG', 3)]})}
这篇关于如何从文件中读取两行,并使用python在for循环中创建动态键?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!