如何有选择性地删除列和行bash或蟒蛇 [英] How to selectively remove columns and rows with bash or python

查看:251
本文介绍了如何有选择性地删除列和行bash或蟒蛇的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

更新

我怀疑我最初把在输入和期望的输出数据是不是我有什么相对于空白完全一样的我。现在我已经把新的输入数据和期望的输出数据。

I suspect that the input and desired output data I initially put in wasn't exactly the same as I what I have with respect to whitespace. I've now put new input data and desired output data.

我现在有一个数据集,看起来像这样:

I currently have a data set that looks like this:

输入

Hybridization REF   TCGA-13-A5FT-01A-11D-A409-05    TCGA-13-A5FT-01A-11D-A409-05    TCGA-13-A5FT-01A-11D-A409-05    TCGA-13-A5FT-01A-11D-A409-05    TCGA-13-A5FU-01A-11D-A409-05    TCGA-13-A5FU-01A-11D-A409-05    TCGA-13-A5FU-01A-11D-A409-05    TCGA-13-A5FU-01A-11D-A409-05    TCGA-29-A5NZ-01A-11D-A409-05    TCGA-29-A5NZ-01A-11D-A409-05    TCGA-29-A5NZ-01A-11D-A409-05    TCGA-29-A5NZ-01A-11D-A409-05    TCGA-3P-A9WA-01A-11D-A409-05    TCGA-3P-A9WA-01A-11D-A409-05    TCGA-3P-A9WA-01A-11D-A409-05    TCGA-3P-A9WA-01A-11D-A409-05    TCGA-59-A5PD-01A-11D-A409-05    TCGA-59-A5PD-01A-11D-A409-05    TCGA-59-A5PD-01A-11D-A409-05    TCGA-59-A5PD-01A-11D-A409-05    TCGA-5X-AA5U-01A-11D-A409-05    TCGA-5X-AA5U-01A-11D-A409-05    TCGA-5X-AA5U-01A-11D-A409-05    TCGA-5X-AA5U-01A-11D-A409-05    TCGA-OY-A56P-01A-12D-A409-05    TCGA-OY-A56P-01A-12D-A409-05    TCGA-OY-A56P-01A-12D-A409-05    TCGA-OY-A56P-01A-12D-A409-05    TCGA-OY-A56Q-01A-11D-A409-05    TCGA-OY-A56Q-01A-11D-A409-05    TCGA-OY-A56Q-01A-11D-A409-05    TCGA-OY-A56Q-01A-11D-A409-05    TCGA-VG-A8LO-01A-11D-A409-05    TCGA-VG-A8LO-01A-11D-A409-05    TCGA-VG-A8LO-01A-11D-A409-05    TCGA-VG-A8LO-01A-11D-A409-05    TCGA-WR-A838-01A-12D-A409-05    TCGA-WR-A838-01A-12D-A409-05    TCGA-WR-A838-01A-12D-A409-05    TCGA-WR-A838-01A-12D-A409-05
Composite Element REF   Beta_value  Gene_Symbol Chromosome  Genomic_Coordinate  Beta_value  Gene_Symbol Chromosome  Genomic_Coordinate  Beta_value  Gene_Symbol Chromosome  Genomic_Coordinate  Beta_value  Gene_Symbol Chromosome  Genomic_Coordinate  Beta_value  Gene_Symbol Chromosome  Genomic_Coordinate  Beta_value  Gene_Symbol Chromosome  Genomic_Coordinate  Beta_value  Gene_Symbol Chromosome  Genomic_Coordinate  Beta_value  Gene_Symbol Chromosome  Genomic_Coordinate  Beta_value  Gene_Symbol Chromosome  Genomic_Coordinate  Beta_value  Gene_Symbol Chromosome  Genomic_Coordinate
cg00000029  0.162232896986279   RBL2    16  53468112    0.191627667901702   RBL2    16  53468112    0.0712181967886229  RBL2    16  53468112    0.0797617926225958  RBL2    16  53468112    0.134907151266991   RBL2    16  53468112    0.0541415985613948  RBL2    16  53468112    0.0898579298345672  RBL2    16  53468112    0.037865566345129   RBL2    16  53468112    0.0681542463965581  RBL2    16  53468112    0.101053013486289   RBL2    16  53468112
cg00000108  NA  C3orf35 3   37459206    NA  C3orf35 3   37459206    NA  C3orf35 3   37459206    NA  C3orf35 3   37459206    NA  C3orf35 3   37459206    NA  C3orf35 3   37459206    NA  C3orf35 3   37459206    NA  C3orf35 3   37459206    NA  C3orf35 3   37459206    NA  C3orf35 3   37459206
cg00000109  NA  FNDC3B  3   171916037   NA  FNDC3B  3   171916037   NA  FNDC3B  3   171916037   NA  FNDC3B  3   171916037   NA  FNDC3B  3   171916037   NA  FNDC3B  3   171916037   NA  FNDC3B  3   171916037   NA  FNDC3B  3   171916037   NA  FNDC3B  3   171916037   NA  FNDC3B  3   171916037

该数据集大很多,几乎10 GB的大小。所以,过大的R级为例做。

The data set is much bigger and is almost 10 GB in size. So too big to do in R for example.

然而,很多的列是有效的复制。例如,我只需要保持每个标题的一列(第二行) Gene_Symbol 染色体 Genomic_Coordinate 。个人 Beta_value 列需要留下来,因为他们每个样品不同。样品ID是第一行。所以上述的例子所需的输出是:

However, alot of the columns are effectively duplicates. For example, I only need to keep one each of the columns titled (second row) Gene_Symbol, Chromosome and Genomic_Coordinate. The individual Beta_value columns need to stay because they are different for each sample. Sample IDs are on the first row. So an example desired output of the above is:

所需的输出

Hybridization REF   Gene_Symbol Chromosome  Genomic_Coordinate  TCGA-13-A5FT-01A-11D-A409-05    TCGA-13-A5FU-01A-11D-A409-05    TCGA-29-A5NZ-01A-11D-A409-05    TCGA-3P-A9WA-01A-11D-A409-05    TCGA-59-A5PD-01A-11D-A409-05    TCGA-5X-AA5U-01A-11D-A409-05    TCGA-OY-A56P-01A-12D-A409-05    TCGA-OY-A56Q-01A-11D-A409-05    TCGA-VG-A8LO-01A-11D-A409-05    TCGA-WR-A838-01A-12D-A409-05
cg00000029  RBL2    16  53468112    0.162232897 0.191627668 0.071218197 0.079761793 0.134907151 0.054141599 0.08985793  0.037865566 0.068154246 0.101053013
cg00000108  C3orf35 3   37459206    NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
cg00000109  FNDC3B  3   171916037   NA  NA  NA  NA  NA  NA  NA  NA  NA  NA

请注意,我已经洗牌列标题的第一行中去除冗余信息。在 TCGA ... 为首列样品标识。需要注意的是硬编码列号将无法工作,因为我需要通过多个文件所有这些都将有样品/列的变量数进行迭代。

Note that I've shuffled the column headers in the first row to remove redundant information. The TCGA... headed columns are the sample identifiers. Note that hard coding the column numbers won't work since I need to iterate through multiple files all of which will have variable numbers of samples/columns.

什么是做与bash或与Python的最有效方法是什么?

What's the most efficient way of doing that with bash or with python?

编辑:

使用以下,但与 \\每个%S T之间约翰Zwinck的回答我现在收到以下错误

Using John Zwinck's answer below but with \t in between each %s I am now getting the following error:

+ NR == 1 '{' ../../../methods/meth_ma_gene/awk_methreformat.sh: line 3: NR: command not found ../../../methods/meth_ma_gene/awk_methreformat.sh: line 5: syntax error near unexpected token `(' ../../../methods/meth_ma_gene/awk_methreformat.sh: line 5: ` for (i=1; i <= NF; i++) {'

我运行bash脚本的awk脚本如下:

I run the awk script in a bash script as follows:

#!/usr/bin/env bash

for f in TCGAOV*; do  

    sed 's/Hybridization REF/Hybridization_REF/; s/Composite Element REF/Composite_Element_REF/' "$f" > "final.$f"

    bash -x ./../../methods/meth_ma_gene/awk_methreformat.sh "final.$f" > "final.$f"

done

在这里引用awk脚本是:

where the referenced awk script is:

#!/usr/bin/awk -f                                                                                                                                                               
NR == 1 {
    # collect sample names                                                                                                                                                               
    for (i=1; i <= NF; i++) {
        sample[i] = $i
    }
}

NR == 2 {
    # first four columns are always the same                                                                                                                                             
    cols[1] = 1
    cols[2] = 3
    cols[3] = 4
    cols[4] = 5
    printf "%s\t%s\t%s\t%s\t", sample[1], $3, $4, $5

    # dynamic columns (in practice: 2,6,10,...)                                                                                                                                          
    for (i=1; i <= NF; i++) {
        if ($i == "Beta_value") {
            cols[length(cols)+1] = i
            printf "%s\t", sample[i]
        }
    }
    printf "\n"
}

NR >= 3 {
    # print cols from data row                                                                                                                                                           
    for (i=1; i <= length(cols); i++) {
        printf "%s\t", $cols[i]
    }
    printf "\n"
}

我想这可能是是与领域之间的间距。请注意,我还修改了上面的投入,因为我错过了,其实没有 _ 在形成单词之间 REF杂交,以及那些形成复合元素REF

I think this may be something to do with the spacing between the fields. Note that I've also amended the input above, since I missed out that there was actually no _ in between the words forming Hybridization REF and also those forming Composite Element REF

推荐答案

您真的不希望将输入数据加载到内存中,因为它是如此之大。相反,流的方法会更快,并为此 AWK 很适合:

You don't really want to load the input data into memory, because it's so large. Instead, a streaming approach will be faster, and for this awk is well suited:

#!/usr/bin/awk -f

BEGIN {
    FS = "\t";
    OFS = FS;
}

NR == 1 {
    # collect sample names                                                                                                                                                               
    for (i=1; i <= NF; i++) {
        sample[i] = $i
    }
}

NR == 2 {
    # first four columns are always the same                                                                                                                                             
    cols[1] = 1
    cols[2] = 3
    cols[3] = 4
    cols[4] = 5
    printf "%s %s %s %s ", sample[1], $3, $4, $5

    # dynamic columns (in practice: 2,6,10,...)                                                                                                                                          
    for (i=1; i <= NF; i++) {
        if ($i == "Beta_value") {
            cols[length(cols)+1] = i
            printf "%s ", sample[i]
        }
    }
    printf "\n"
}

NR >= 3 {
    # print cols from data row                                                                                                                                                           
    for (i=1; i <= length(cols); i++) {
        printf "%s ", $cols[i]
    }
    printf "\n"
}

这使您所需的输出。如果您想了解更多的速度,你可以考虑使用 AWK 简单地打印列号(只需要读取两个标题行),那么削减实际打印。因为没有跨preTED code需要为每个数据行运行这个会更快。对于问题的样本数据中,剪切命令你需要打印所有数据行是这样的:

This gives your desired output. If you want more speed, you might consider using awk simply to print the column numbers (which only requires reading the two header rows), then cut to actually print them. This will be faster because no interpreted code needs to run for each data row. For the sample data in the question, the cut command you need to print all the data rows is something like this:

cut -d '\t' -f 1,3,4,5,2,6

这篇关于如何有选择性地删除列和行bash或蟒蛇的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆