如何从文件中读取两行并在for循环中创建动态键，后续操作 [英] How to read two lines from a file and create dynamics keys in a for-loop, a follow-up

查看：110 发布时间：2020/5/5 13:48:58 python pandas numpy dictionary defaultdict

本文介绍了如何从文件中读取两行并在for循环中创建动态键，后续操作的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

但是，问题的性质已经发展到我要解决的某些复杂性.

下面是我的数据结构，以空格分隔.

chr pos         M1  M2  Mk  Mg1  F1_hybrid     F1_PG    F1_block    S1  Sk1   S2    Sj
2   16229767    T/T T/T T/T G/T C|T 1|0 726  .  T/C T/C T/C
2   16229783    C/C C/C C/C A/C G|C 0|1 726 G/C G/C G/C C|G
2   16229992    A/A A/A A/A G/A G|A 1|0 726 A/A A/A A/A A|G
2   16230007    T/T T/T T/T A/T A|T 1|0 726 A|T A|T A|T A|T
2   16230011    G/G G/G G/G G/G C|G 1|0 726 G/C C|G C|G G/C
2   16230049    A/A A/A A/A A/A T|A 1|0 726 A|T .   A/T A/T
2   16230174    .   .   .   C/C T|C 1|0 726 C|T T|C T|C C|T
2   16230190    A/A A/A A/A A/A T|A 1|0 726 T|G G|T T|G T|G
2   16230260    A/A A/A A/A A/A G|A 1|0 726 G/G G/G G/G G/G

说明:

上面的文件中有两个主要的数据类别.来自Group M的数据的样本名称以 M 开头，并且类似的group S的多个列名称以 S 开头.
并且有一个混合列(由 F1_hybrid 表示).
数据是位置线上的字符串. F1_hybrid与区分两个字母的竖线(|)相接.因此，来自F1的两个字符串值为C-G-G-A-C-T-T-T-G，而另一个字符串值为T-C-A-T-G-A-C-A-A.此字符串之一来自 M-group ，而另一字符串来自 S-group ，但是我需要做一些统计分析.但是，我可以从视觉上看出T-C-A-T-G-A-C-A-A字符串最有可能来自M-group.

程序:

我阅读第一行并使用列信息创建唯一键.
然后，我读取第二行和第三行以及F1_hybrid中的值，即C | T和G | C.现在，我需要计算在M组与S组之间存在多少GgC(以C给定的G来解释)和CgT(以T给定的C来代表).
然后在F1_hybrid中用第4(G | A)行读取第3(G | C).因此，状态是GgG和AgC.同样，我现在认为在M vs. S组中存在许多GcG vs. AgC.

因此，我正在尝试构建一个Markov-model，它计算来自F1的定相字符串的状态数，并在group M与group S中采用观察到的计数.

我现在在解释，如何根据F1_hyrbid计算任何XgY(X等于Y)的数量:

在计数之前，请注意条件. 现有条件可以是有阶段的(用管道表示)与无阶段的(如果两行中至少有一个斜杠(/)的话).

条件01:

对于第二行和第三行，M1示例的状态为(带有C/C的T/T).由于分隔符是斜杠(/)而不是竖线(|)，因此我们无法确定M1-sample处于哪个确切状态.但是，我们可以创建组合矩阵(用于先前状态与当前状态)

    T     T
C  CgT   CgT
C  CgT   CgT

现在，我们可以确定共有 4个CgT

并且如果满足此条件，我们将继续执行相同的矩阵.

条件02

来自M组的其他样品也是如此，但Mg1除外，其中G/T位于A/C之前.因此，矩阵为:

    G     T
A  AgG   AgT
C  CgG   CgT

因此，在这里我们观察到1个CgT计数.

条件03:

但是，如果先前的状态-当前状态在两个状态下都通过管道定相(例如，样本Sk1的位置A | T在位置16230007上的C | G在位置16230011)在该位置上观察到的状态的相态计数，只有CgA和GgT，因此CgT的计数为0.

条件04: 如果其中一个州使用竖线(|)，而另一个州使用斜杠(/)，则该条件将与两个州都使用斜杠的条件相同.

条件05: 如果previous_state或present_state中的任何一个为period(.)，则对于F1_hybrid预期的状态，观察计数将自动为零(0).

因此，预期的输出应该是这样的:

pos     M1  M2  Mk  Mg1 H0  H1  S1  Sk1 S2  Sj
16..9783    4-CgT   4-CgT   4-CgT   1-CgT   GgC CgT 0   1-CgT   1-CgT   1-CgT
16..9992    4-AgC   4-AgC   4-AgC   2-AgC   GgG AgC 1-AgC   1-AgC   1-AgC   1-AgC,1-GgG
16..0007    4-TgA   4-TgA   4-TgA   1-AgG,1-TgA AgG TgA 2-TgA   2-TgA   2-TgA1  1-TgA
..................contd

或者，每列的字典格式值也将同样起作用.类似于['4-CgT','4-CgT','4-CgT','4-CgT']的第一个M1在位置16..9783，对于其他的相同.

解决方案

这个问题有点老了，但是很有趣，因为您有一个非常清晰的规范，并且需要帮助来编写代码.我将采用自上而下的方法(这是一个非常知名的方法)，使用普通的旧python公开解决方案.适应大熊猫应该不难.

自上而下的方法对我而言意味着:如果您不知道如何编写，只需命名！

您有一个文件(或字符串)作为输入，并且想要输出一个文件(或字符串).看起来很简单，但是您希望合并成对的行以构建每个新行.这个想法是:

获取输入的行作为字典
带两个孩子
为每对创建一个新行
输出结果

您现在还不知道如何编写行生成器.您也不知道如何为每对创建新行.不要因困难而受阻，只需给出解决方案的名称即可.假设您有一个函数get_rows和一个函数build_new_row.让我们这样写:

def build_new_rows(f):
    """generate the new rows. Output may be redirected to a file"""
    rows = get_rows(f) # get a generator on rows = dictionaries.
    r1 = next(rows) # store the first row
    for r2 in rows: # for every following row
        yield build_new_row(r1, r2) # yield a new row built of the previous stored row and the current row.
        r1 = r2 # store the current row, which becomes the previous row

现在，检查两个缺失"功能:get_rows和build_new_row. 函数get_rows很容易编写.这是主要部分:

header = process_line(next(f))
for line in f:
    yield {k:v for k,v in zip(header, process_line(line))}

其中process_line只是在空间上分割行，例如与re.split("\s+", line.strip()).

第二部分是build_new_row.还是自上而下的方法:您需要从期望的表中构建H0和H1，然后根据暴露的条件为每个M和S生成H1的计数.假设您有一个pipe_compute函数计算H0和H1，并且build_count函数为每个M和S构建H1的计数:

def build_new_row(r1, r2):
    """build a row"""
    h0, h1 = pipe_compute(r1["F1_hybrid"], r2["F1_hybrid"])

    # initialize the dict whith the pos, H0 and H1
    new_row = {"pos":r2["pos"], "H0":h0, "H1":h1}

    for key in r1.keys():
        if key[0] in ("M", "S"):
            new_row[key] = build_count(r1[key], r2[key], h1)

    return new_row

您现在几乎拥有了一切.看看pipe_compute:这正是您在条件03下编写的内容.

def pipe_compute(v1, v2):
    """build H0 H1 according to condition 03"""
    xs = v1.split("|")
    ys = v2.split("|")
    return [ys[0]+"g"+xs[0], ys[1]+"g"+xs[1]]

对于buid_count，请坚持自上而下的方法:

def build_count(v1, v2, to_count):
    """nothing funny here: just follow the conditions"""
    if is_slash_count(v1, v2): # are conditions 01, 02, 04 true ?
        c = slash_count(v1, v2)[to_count] # count how many "to_count" we find in the 2 x 2 table of condtions 01 or 02.
    elif "|" in v1 and "|" in v2: # condition 03
        c = pipe_count(v1, v2)[to_count]
    elif "." in v1 or "." in v2: # condition 05
        return '0'
    else:
        raise Exception(v1, v2)

    return "{}-{}".format(c, to_count) # n-XgY

我们还在下降.我们什么时候有is_slash_count?两个斜杠(条件01和02)或一个斜杠和一个管道(条件04):

def is_slash_count(v1, v2):
    """conditions 01, 02, 04"""
    return "/" in v1 and "/" in v2 or "/" in v1 and "|" in v2 or "|" in v1 and "/" in v2

函数slash_count只是条件01和02的2 x 2表:

def slash_count(v1, v2):
    """count according to conditions 01, 02, 04"""
    cnt = collections.Counter()
    for x in re.split("[|/]", v1): # cartesian product
        for y in re.split("[|/]", v2): # cartesian product
            cnt[y+"g"+x] += 1
    return cnt # a dictionary XgY -> count(XgY)

函数pipe_count甚至更简单，因为您只需要计算pipe_compute的结果:

def pipe_count(v1, v2):
    """count according to condition 03"""
    return collections.Counter(pipe_compute(v1, v2))

现在您已经完成了(上下).我得到的结果与您的期望稍有不同，但是您当然已经看到了我的错误:

pos M1  M2  Mk  Mg1 H0  H1  S1  Sk1 S2  Sj
16229783    4-CgT   4-CgT   4-CgT   1-CgT   GgC CgT 0   1-CgT   1-CgT   1-CgT
16229992    4-AgC   4-AgC   4-AgC   1-AgC   GgG AgC 2-AgC   2-AgC   2-AgC   1-AgC
16230007    4-TgA   4-TgA   4-TgA   1-TgA   AgG TgA 2-TgA   2-TgA   2-TgA   0-TgA
16230011    4-GgT   4-GgT   4-GgT   2-GgT   CgA GgT 1-GgT   1-GgT   1-GgT   1-GgT
16230049    4-AgG   4-AgG   4-AgG   4-AgG   TgC AgG 1-AgG   0   1-AgG   1-AgG
16230174    0   0   0   4-CgA   TgT CgA 1-CgA   0   1-CgA   1-CgA
16230190    0   0   0   4-AgC   TgT AgC 0-AgC   0-AgC   0-AgC   0-AgC
16230260    4-AgA   4-AgA   4-AgA   4-AgA   GgT AgA 0-AgA   0-AgA   0-AgA   0-AgA

加成:重要的是，除了解决这个特定问题外，我使用的方法已广泛用于软件开发中.该代码可能会改进很多.

This question follows the problem in question: How to read two lines from a file and create dynamics keys in a for-loop?

But, the nature of the problem has evolved to certain complexity that I want to address.

Below is the structure of my data separated by space.

chr pos         M1  M2  Mk  Mg1  F1_hybrid     F1_PG    F1_block    S1  Sk1   S2    Sj
2   16229767    T/T T/T T/T G/T C|T 1|0 726  .  T/C T/C T/C
2   16229783    C/C C/C C/C A/C G|C 0|1 726 G/C G/C G/C C|G
2   16229992    A/A A/A A/A G/A G|A 1|0 726 A/A A/A A/A A|G
2   16230007    T/T T/T T/T A/T A|T 1|0 726 A|T A|T A|T A|T
2   16230011    G/G G/G G/G G/G C|G 1|0 726 G/C C|G C|G G/C
2   16230049    A/A A/A A/A A/A T|A 1|0 726 A|T .   A/T A/T
2   16230174    .   .   .   C/C T|C 1|0 726 C|T T|C T|C C|T
2   16230190    A/A A/A A/A A/A T|A 1|0 726 T|G G|T T|G T|G
2   16230260    A/A A/A A/A A/A G|A 1|0 726 G/G G/G G/G G/G

Explanation:

there are two major categories of data in the above file. Data from Group M have sample name starting with M, and similarly group S that has several columns names starting with S.
And there is a hybrid column (represented by F1_hybrid).
the data is the string along the position line. The F1_hybrid is phased with pipe (|) distinguishing the two letters. So, the two strings values from F1 are C-G-G-A-C-T-T-T-G, while another string value is T-C-A-T-G-A-C-A-A. One of this string is from M-group while the other is from S-group but I need to do some statistical analyses to do so. However, I can tell that visually that T-C-A-T-G-A-C-A-A string most likely came from M-group.

Procedure:

I read the first line and create a unique keys using the column information.
Then I read the second and 3rd line and the values in F1_hybrid, which is C|T with G|C. Now, I need to calculate how many GgC (explained as G given C) vs. CgT (C given T) exist between M-group vs. S group.
Then read 3rd (G|C) with 4th (G|A) line in F1_hybrid. So, the states are GgG and AgC. Similarly, I now count have many GcG vs. AgC exist in M vs. S group.

Therefore, I am trying to build a Markov-model which counts the number of state for a phased string from F1 and taking the observed counts in group M vs group S.

I am now explaining, how to count the number of any XgY (X given Y) based on F1_hyrbid:

It important to note the conditions before doing the count. The existing condition may be phased (which is represented by having pipe) vs. unphased (if the if two line have at least one slash (/).

Condition 01:

The M1 sample has state as (T/T with C/C) for 2nd and 3rd line. since the separator is a slash (/) and not pipe (|) we cannot tell which exact state M1-sample is in. But, we can create combination matrix (for previous state with present state)

    T     T
C  CgT   CgT
C  CgT   CgT

Now, we can tell that there are 4 total CgT

and we keep doing the same matrix if this condition meets.

Condition 02

Same is the case for other samples from Group M, except for Mg1 where the G/T is preceeding A/C. So, the matrix is:

    G     T
A  AgG   AgT
C  CgG   CgT

So, here we observed 1 count of CgT.

Condition 03:

But, if the earlier state - present state are phased by pipe in both states (like A|T at position 16230007 with C|G at position 16230011 for sample Sk1) we can do a direct count of phase state of observed state at that position, that there are only CgA and GgT, so count of CgT is 0.

Condition 04: If one of the state has pipe (|) but other has slash (/), the condition will be same as both state having slash.

Condition 05: If any of the previous_state or present_state is period(.) the observation count is automatically zero (0) for the state expected from F1_hybrid.

So, the expected output should be something like this:

pos     M1  M2  Mk  Mg1 H0  H1  S1  Sk1 S2  Sj
16..9783    4-CgT   4-CgT   4-CgT   1-CgT   GgC CgT 0   1-CgT   1-CgT   1-CgT
16..9992    4-AgC   4-AgC   4-AgC   2-AgC   GgG AgC 1-AgC   1-AgC   1-AgC   1-AgC,1-GgG
16..0007    4-TgA   4-TgA   4-TgA   1-AgG,1-TgA AgG TgA 2-TgA   2-TgA   2-TgA1  1-TgA
..................contd

Or, the values in dictionary format for each column would equally work. Something like ['4-CgT','4-CgT','4-CgT','4-CgT'] for first M1 at position 16..9783 and same for other.

解决方案

The question is a bit old, but interesting because you have a very clear specification and you need help to write the code. I will expose a solution following a top-down approach, which is a very well known method, using plain old python. It shouldn't be difficult to adapt to pandas.

The top-down approach means to me: if you don't know how to write it, just name it!

You have a file (or a string) as input, and you want to output a file (or a string). It seems quite simple, but you want to merge pairs of rows to build every new row. The idea is:

get the rows of the input, as dictionaries
take them by two
build a new row for each pair
output the result

You don't know for now how to write the generator of rows. You don't know either how to build a new row for each pair. Don't stay blocked by the difficulties, just name the solutions. Imagine you have a function get_rows and a function build_new_row. Let's write this:

def build_new_rows(f):
    """generate the new rows. Output may be redirected to a file"""
    rows = get_rows(f) # get a generator on rows = dictionaries.
    r1 = next(rows) # store the first row
    for r2 in rows: # for every following row
        yield build_new_row(r1, r2) # yield a new row built of the previous stored row and the current row.
        r1 = r2 # store the current row, which becomes the previous row

Now, examine the two "missing" functions: get_rows and build_new_row. The function get_rows is quite easy to write. Here's the main part:

header = process_line(next(f))
for line in f:
    yield {k:v for k,v in zip(header, process_line(line))}

where process_line just splits the line on space, e.g. with a re.split("\s+", line.strip()).

The second part is build_new_row. Still the top-down approach: you need to build H0 and H1 from your expected table, and then to build the count of H1 for every M and S according to the conditions you exposed. Pretend you have a pipe_compute function that compute H0 and H1, and a build_count function that builds the count of H1 for every M and S:

def build_new_row(r1, r2):
    """build a row"""
    h0, h1 = pipe_compute(r1["F1_hybrid"], r2["F1_hybrid"])

    # initialize the dict whith the pos, H0 and H1
    new_row = {"pos":r2["pos"], "H0":h0, "H1":h1}

    for key in r1.keys():
        if key[0] in ("M", "S"):
            new_row[key] = build_count(r1[key], r2[key], h1)

    return new_row

You have almost everything now. Take a look at pipe_compute: it's exactly what you have written in your condition 03.

def pipe_compute(v1, v2):
    """build H0 H1 according to condition 03"""
    xs = v1.split("|")
    ys = v2.split("|")
    return [ys[0]+"g"+xs[0], ys[1]+"g"+xs[1]]

And for buid_count, stick to the top-down approach:

def build_count(v1, v2, to_count):
    """nothing funny here: just follow the conditions"""
    if is_slash_count(v1, v2): # are conditions 01, 02, 04 true ?
        c = slash_count(v1, v2)[to_count] # count how many "to_count" we find in the 2 x 2 table of condtions 01 or 02.
    elif "|" in v1 and "|" in v2: # condition 03
        c = pipe_count(v1, v2)[to_count]
    elif "." in v1 or "." in v2: # condition 05
        return '0'
    else:
        raise Exception(v1, v2)

    return "{}-{}".format(c, to_count) # n-XgY

We are still going down. When do we have is_slash_count? Two slashes (conditions 01 and 02) or one slash and one pipe (condition 04):

def is_slash_count(v1, v2):
    """conditions 01, 02, 04"""
    return "/" in v1 and "/" in v2 or "/" in v1 and "|" in v2 or "|" in v1 and "/" in v2

The function slash_count is simply the 2 x 2 table of conditions 01 and 02:

def slash_count(v1, v2):
    """count according to conditions 01, 02, 04"""
    cnt = collections.Counter()
    for x in re.split("[|/]", v1): # cartesian product
        for y in re.split("[|/]", v2): # cartesian product
            cnt[y+"g"+x] += 1
    return cnt # a dictionary XgY -> count(XgY)

The function pipe_count is even simpler, because you just have to count the result of pipe_compute:

def pipe_count(v1, v2):
    """count according to condition 03"""
    return collections.Counter(pipe_compute(v1, v2))

Now you're done (and down). I get this result, which is slightly different from your expectation, but you certainly have already seen my mistake(s?):

pos M1  M2  Mk  Mg1 H0  H1  S1  Sk1 S2  Sj
16229783    4-CgT   4-CgT   4-CgT   1-CgT   GgC CgT 0   1-CgT   1-CgT   1-CgT
16229992    4-AgC   4-AgC   4-AgC   1-AgC   GgG AgC 2-AgC   2-AgC   2-AgC   1-AgC
16230007    4-TgA   4-TgA   4-TgA   1-TgA   AgG TgA 2-TgA   2-TgA   2-TgA   0-TgA
16230011    4-GgT   4-GgT   4-GgT   2-GgT   CgA GgT 1-GgT   1-GgT   1-GgT   1-GgT
16230049    4-AgG   4-AgG   4-AgG   4-AgG   TgC AgG 1-AgG   0   1-AgG   1-AgG
16230174    0   0   0   4-CgA   TgT CgA 1-CgA   0   1-CgA   1-CgA
16230190    0   0   0   4-AgC   TgT AgC 0-AgC   0-AgC   0-AgC   0-AgC
16230260    4-AgA   4-AgA   4-AgA   4-AgA   GgT AgA 0-AgA   0-AgA   0-AgA   0-AgA

Bonus: Try it online!

What is important is, beyond the solution to this specific problem, the method I used and which is widely used in software development. The code may be improved a lot.

这篇关于如何从文件中读取两行并在for循环中创建动态键，后续操作的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何从文件中读取两行并在for循环中创建动态键，后续操作 [英] How to read two lines from a file and create dynamics keys in a for-loop, a follow-up

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何从文件中读取两行并在for循环中创建动态键，后续操作 [英] How to read two lines from a file and create dynamics keys in a for-loop, a follow-up

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭