如何从文件中读取两行并在for循环中创建动态键? [英] How to read two lines from a file and create dynamics keys in a for-loop?
问题描述
在以下数据中,我试图运行一个简单的markov模型.
In the following data, I am trying to run a simple markov model.
说我有一个具有以下结构的数据:
pos M1 M2 M3 M4 M5 M6 M7 M8 hybrid_block S1 S2 S3 S4 S5 S6 S7 S8
1 A T T A A G A C A|C C G C T T A G A
2 T G C T G T T G T|A A T A T C A A T
3 C A A C A G T C C|G G A C G C G C G
4 G T G T A T C T G|T C T T T A T C T
块M 代表一组类别的数据,块S 也是如此.
Block M represents data from one set of catergories, so does block S.
数据是通过沿位置线连接字母而形成的strings
.因此,M1的字符串值是A-T-C-G ,其他每个块也是如此.
The data are the strings
which are made by connecting letter along the position line. So, the string value for M1 is A-T-C-G, and so is for every other block.
还有一个hybrid block
,其中有两个以相同方式读取的字符串. 问题是我想找出混合块中的哪个字符串最有可能来自哪个块(M对S)?
There is also one hybrid block
that has two string which is read in same way. The question is I want to find which string in the hybrid block most likely came from which block (M vs. S)?
我正在尝试建立一个markov模型,该模型可以帮助我识别hybrid block
中的哪个字符串来自哪个块.在此示例中,我可以说在混合块中,ATCG
来自block M
,而CAGT
来自block S
.
I am trying to build a markov model which can help me identify which string in hybrid block
came from which blocks. In this example I can tell that in hybrid block ATCG
came from block M
and CAGT
came from block S
.
我将问题分为不同的部分以读取和挖掘数据:
问题级别01:
- 首先,我阅读第一行(标题),并为所有列创建
unique keys
. - 然后我阅读第二行(值 1 的
pos
)并创建另一个密钥.在同一行中,我从hybrid_block
中读取值并读取其中的字符串值.pipe |
只是一个分隔符,因此index 0 and 2
中的两个字符串分别为A
和C
.所以,我要从这行开始的是
- First I read the first line (the header) and create
unique keys
for all the columns. - Then I read the 2nd line (
pos
with value 1) and create another key. In the same line I read the value fromhybrid_block
and read the strings value in it. Thepipe |
is just a separator, so two strings are inindex 0 and 2
asA
andC
. So, all I want from this line is a
defaultdict(<class 'dict'>, {'M1': ['A'], 'M2': ['T'], 'M3': ['T']...., 'hybrid_block': ['A'], ['C']...}
现在,我继续阅读该行,我想从每一列中附加字符串值,最后创建.
As, I progress with reading the line, I want to append the strings values from each column and finally create.
defaultdict(<class 'dict'>, {'M1': ['A', 'T', 'C', 'G'], 'M2': ['T', 'G', 'A', 'T'], 'M3': ['T', 'C', 'A', 'G']...., 'hybrid_block': ['A', 'T', 'C', 'G'], ['C', 'A', 'G', 'T']...}
问题级别02:
-
我读取
hybrid_block
中第一行的数据,它们是A and C
.
I read the data in
hybrid_block
for the first line which areA and C
.
现在,我想创建keys' but unlike fixed keys, these key will be generated while reading the data from
hybrid_blocks .
For the first line since there are no preceding line the
keys will simply be
AgA and
CgC which means (A given A, and C given C), and for the values I count the number of
A in
block M and
block S`.因此,数据将存储为:
Now, I want to create keys' but unlike fixed keys, these key will be generated while reading the data from
hybrid_blocks.
For the first line since there are no preceding line the
keyswill simply be
AgAand
CgCwhich means (A given A, and C given C), and for the values I count the number of
Ain
block Mand
block S`. So, the data will be stored as:
defaultdict(<class 'dict'>, {'M': {'AgA': [4], 'CgC': [1]}, 'S': {'AgA': 2, 'CgC': 2}}
是的,我通读了其他各行,我想根据hybrid block
中的字符串创建新的键,并在给定该行中的字符串的情况下,计算该字符串在M vs S
块中存在的次数.这意味着在读取line 2
时keys
在该行中将是TgA' which means (T given A) and AgC. For the values inside this key I count the number of times I found
T,在前一行and same for
AcG`中位于A之后.
As, I read through other lines I want to create new keys based on what are the strings in hybrid block
and count the number of times that string was present in M vs S
block given the string in preceeding line. That means the keys
while reading line 2
would be TgA' which means (T given A) and AgC. For the values inside this key I count the number of times I found
T in this line, after A in the previous lineand same for
AcG`.
读取3行后的defaultdict
将是
defaultdict(<class 'dict'>, {'M': {'AgA': 4, 'TgA':3, 'CgT':2}, {'CgC': [1], 'AgC':0, 'GgA':0}, 'S': {'AgA': 2, 'TgA':1, 'CgT':0}, {'CgC': 2, 'AgC':2, 'GgA':2}}
我知道这看起来太复杂了.我经历了几个dictionary
和defaultdict
教程,但是找不到做到这一点的方法.
I understand this looks too complicated. I went through several dictionary
and defaultdict
tutorial but couldn't find a way of doing this.
高度赞赏解决所有问题的方法.
Solution to any part if not both is highly appreciated.
推荐答案
pandas
设置
from io import StringIO
import pandas as pd
import numpy as np
txt = """pos M1 M2 M3 M4 M5 M6 M7 M8 hybrid_block S1 S2 S3 S4 S5 S6 S7 S8
1 A T T A A G A C A|C C G C T T A G A
2 T G C T G T T G T|A A T A T C A A T
3 C A A C A G T C C|G G A C G C G C G
4 G T G T A T C T G|T C T T T A T C T """
df = pd.read_csv(StringIO(txt), delim_whitespace=True, index_col='pos')
df
- 拆分混合列
- 在相同的第一行之前加上
- 添加self的转换版本以获取
'AgA'
类型的字符串
- split hybrid column
- prepend identical first row
- add with shifted version of self to get
'AgA'
type strings
d1 = pd.concat([df.loc[[1]].rename(index={1: 0}), df])
d1 = pd.concat([
df.filter(like='M'),
df.hybrid_block.str.split('|', expand=True).rename(columns='H{}'.format),
df.filter(like='S')
], axis=1)
d1 = pd.concat([d1.loc[[1]].rename(index={1: 0}), d1])
d1 = d1.add('g').add(d1.shift()).dropna()
d1
为方便的块分配自己的变量名
Assign convenient blocks to their own variable names
m = d1.filter(like='M')
s = d1.filter(like='S')
h = d1.filter(like='H')
计算每个块中有多少并连接
Count how many are in each block and concatenate
mcounts = pd.DataFrame(
(m.values[:, :, None] == h.values[:, None, :]).sum(1),
h.index, h.columns
)
scounts = pd.DataFrame(
(s.values[:, :, None] == h.values[:, None, :]).sum(1),
h.index, h.columns
)
counts = pd.concat([mcounts, scounts], axis=1, keys=['M', 'S'])
counts
如果您真的想要字典
d = defaultdict(lambda:defaultdict(list))
dict_df = counts.stack().join(h.stack().rename('condition')).unstack()
for pos, row in dict_df.iterrows():
d['M']['H0'].append((row.loc[('condition', 'H0')], row.loc[('M', 'H0')]))
d['S']['H0'].append((row.loc[('condition', 'H0')], row.loc[('S', 'H0')]))
d['M']['H1'].append((row.loc[('condition', 'H1')], row.loc[('M', 'H1')]))
d['S']['H1'].append((row.loc[('condition', 'H1')], row.loc[('S', 'H1')]))
dict(d)
{'M': defaultdict(list,
{'H0': [('AgA', 4), ('TgA', 3), ('CgT', 2), ('GgC', 1)],
'H1': [('CgC', 1), ('AgC', 0), ('GgA', 0), ('TgG', 1)]}),
'S': defaultdict(list,
{'H0': [('AgA', 2), ('TgA', 1), ('CgT', 0), ('GgC', 0)],
'H1': [('CgC', 2), ('AgC', 2), ('GgA', 2), ('TgG', 3)]})}
这篇关于如何从文件中读取两行并在for循环中创建动态键?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!