如何有选择性地删除列和行bash或蟒蛇 [英] How to selectively remove columns and rows with bash or python
问题描述
更新
我怀疑我最初把在输入和期望的输出数据是不是我有什么相对于空白完全一样的我。现在我已经把新的输入数据和期望的输出数据。
I suspect that the input and desired output data I initially put in wasn't exactly the same as I what I have with respect to whitespace. I've now put new input data and desired output data.
我现在有一个数据集,看起来像这样:
I currently have a data set that looks like this:
输入
Hybridization REF TCGA-13-A5FT-01A-11D-A409-05 TCGA-13-A5FT-01A-11D-A409-05 TCGA-13-A5FT-01A-11D-A409-05 TCGA-13-A5FT-01A-11D-A409-05 TCGA-13-A5FU-01A-11D-A409-05 TCGA-13-A5FU-01A-11D-A409-05 TCGA-13-A5FU-01A-11D-A409-05 TCGA-13-A5FU-01A-11D-A409-05 TCGA-29-A5NZ-01A-11D-A409-05 TCGA-29-A5NZ-01A-11D-A409-05 TCGA-29-A5NZ-01A-11D-A409-05 TCGA-29-A5NZ-01A-11D-A409-05 TCGA-3P-A9WA-01A-11D-A409-05 TCGA-3P-A9WA-01A-11D-A409-05 TCGA-3P-A9WA-01A-11D-A409-05 TCGA-3P-A9WA-01A-11D-A409-05 TCGA-59-A5PD-01A-11D-A409-05 TCGA-59-A5PD-01A-11D-A409-05 TCGA-59-A5PD-01A-11D-A409-05 TCGA-59-A5PD-01A-11D-A409-05 TCGA-5X-AA5U-01A-11D-A409-05 TCGA-5X-AA5U-01A-11D-A409-05 TCGA-5X-AA5U-01A-11D-A409-05 TCGA-5X-AA5U-01A-11D-A409-05 TCGA-OY-A56P-01A-12D-A409-05 TCGA-OY-A56P-01A-12D-A409-05 TCGA-OY-A56P-01A-12D-A409-05 TCGA-OY-A56P-01A-12D-A409-05 TCGA-OY-A56Q-01A-11D-A409-05 TCGA-OY-A56Q-01A-11D-A409-05 TCGA-OY-A56Q-01A-11D-A409-05 TCGA-OY-A56Q-01A-11D-A409-05 TCGA-VG-A8LO-01A-11D-A409-05 TCGA-VG-A8LO-01A-11D-A409-05 TCGA-VG-A8LO-01A-11D-A409-05 TCGA-VG-A8LO-01A-11D-A409-05 TCGA-WR-A838-01A-12D-A409-05 TCGA-WR-A838-01A-12D-A409-05 TCGA-WR-A838-01A-12D-A409-05 TCGA-WR-A838-01A-12D-A409-05
Composite Element REF Beta_value Gene_Symbol Chromosome Genomic_Coordinate Beta_value Gene_Symbol Chromosome Genomic_Coordinate Beta_value Gene_Symbol Chromosome Genomic_Coordinate Beta_value Gene_Symbol Chromosome Genomic_Coordinate Beta_value Gene_Symbol Chromosome Genomic_Coordinate Beta_value Gene_Symbol Chromosome Genomic_Coordinate Beta_value Gene_Symbol Chromosome Genomic_Coordinate Beta_value Gene_Symbol Chromosome Genomic_Coordinate Beta_value Gene_Symbol Chromosome Genomic_Coordinate Beta_value Gene_Symbol Chromosome Genomic_Coordinate
cg00000029 0.162232896986279 RBL2 16 53468112 0.191627667901702 RBL2 16 53468112 0.0712181967886229 RBL2 16 53468112 0.0797617926225958 RBL2 16 53468112 0.134907151266991 RBL2 16 53468112 0.0541415985613948 RBL2 16 53468112 0.0898579298345672 RBL2 16 53468112 0.037865566345129 RBL2 16 53468112 0.0681542463965581 RBL2 16 53468112 0.101053013486289 RBL2 16 53468112
cg00000108 NA C3orf35 3 37459206 NA C3orf35 3 37459206 NA C3orf35 3 37459206 NA C3orf35 3 37459206 NA C3orf35 3 37459206 NA C3orf35 3 37459206 NA C3orf35 3 37459206 NA C3orf35 3 37459206 NA C3orf35 3 37459206 NA C3orf35 3 37459206
cg00000109 NA FNDC3B 3 171916037 NA FNDC3B 3 171916037 NA FNDC3B 3 171916037 NA FNDC3B 3 171916037 NA FNDC3B 3 171916037 NA FNDC3B 3 171916037 NA FNDC3B 3 171916037 NA FNDC3B 3 171916037 NA FNDC3B 3 171916037 NA FNDC3B 3 171916037
该数据集大很多,几乎10 GB的大小。所以,过大的R级为例做。
The data set is much bigger and is almost 10 GB in size. So too big to do in R for example.
然而,很多的列是有效的复制。例如,我只需要保持每个标题的一列(第二行) Gene_Symbol
,染色体
和 Genomic_Coordinate
。个人 Beta_value
列需要留下来,因为他们每个样品不同。样品ID是第一行。所以上述的例子所需的输出是:
However, alot of the columns are effectively duplicates. For example, I only need to keep one each of the columns titled (second row) Gene_Symbol
, Chromosome
and Genomic_Coordinate
. The individual Beta_value
columns need to stay because they are different for each sample. Sample IDs are on the first row. So an example desired output of the above is:
所需的输出
Hybridization REF Gene_Symbol Chromosome Genomic_Coordinate TCGA-13-A5FT-01A-11D-A409-05 TCGA-13-A5FU-01A-11D-A409-05 TCGA-29-A5NZ-01A-11D-A409-05 TCGA-3P-A9WA-01A-11D-A409-05 TCGA-59-A5PD-01A-11D-A409-05 TCGA-5X-AA5U-01A-11D-A409-05 TCGA-OY-A56P-01A-12D-A409-05 TCGA-OY-A56Q-01A-11D-A409-05 TCGA-VG-A8LO-01A-11D-A409-05 TCGA-WR-A838-01A-12D-A409-05
cg00000029 RBL2 16 53468112 0.162232897 0.191627668 0.071218197 0.079761793 0.134907151 0.054141599 0.08985793 0.037865566 0.068154246 0.101053013
cg00000108 C3orf35 3 37459206 NA NA NA NA NA NA NA NA NA NA
cg00000109 FNDC3B 3 171916037 NA NA NA NA NA NA NA NA NA NA
请注意,我已经洗牌列标题的第一行中去除冗余信息。在 TCGA ...
为首列样品标识。需要注意的是硬编码列号将无法工作,因为我需要通过多个文件所有这些都将有样品/列的变量数进行迭代。
Note that I've shuffled the column headers in the first row to remove redundant information. The TCGA...
headed columns are the sample identifiers. Note that hard coding the column numbers won't work since I need to iterate through multiple files all of which will have variable numbers of samples/columns.
什么是做与bash或与Python的最有效方法是什么?
What's the most efficient way of doing that with bash or with python?
编辑:
使用以下,但与 \\每个
约翰Zwinck的回答我现在收到以下错误%S
T之间
Using John Zwinck's answer below but with \t
in between each %s
I am now getting the following error:
+ NR == 1 '{' ../../../methods/meth_ma_gene/awk_methreformat.sh: line 3: NR: command not found ../../../methods/meth_ma_gene/awk_methreformat.sh: line 5: syntax error near unexpected token `(' ../../../methods/meth_ma_gene/awk_methreformat.sh: line 5: ` for (i=1; i <= NF; i++) {'
我运行bash脚本的awk脚本如下:
I run the awk script in a bash script as follows:
#!/usr/bin/env bash
for f in TCGAOV*; do
sed 's/Hybridization REF/Hybridization_REF/; s/Composite Element REF/Composite_Element_REF/' "$f" > "final.$f"
bash -x ./../../methods/meth_ma_gene/awk_methreformat.sh "final.$f" > "final.$f"
done
在这里引用awk脚本是:
where the referenced awk script is:
#!/usr/bin/awk -f
NR == 1 {
# collect sample names
for (i=1; i <= NF; i++) {
sample[i] = $i
}
}
NR == 2 {
# first four columns are always the same
cols[1] = 1
cols[2] = 3
cols[3] = 4
cols[4] = 5
printf "%s\t%s\t%s\t%s\t", sample[1], $3, $4, $5
# dynamic columns (in practice: 2,6,10,...)
for (i=1; i <= NF; i++) {
if ($i == "Beta_value") {
cols[length(cols)+1] = i
printf "%s\t", sample[i]
}
}
printf "\n"
}
NR >= 3 {
# print cols from data row
for (i=1; i <= length(cols); i++) {
printf "%s\t", $cols[i]
}
printf "\n"
}
我想这可能是是与领域之间的间距。请注意,我还修改了上面的投入,因为我错过了,其实没有 _
在形成单词之间 REF杂交
,以及那些形成复合元素REF
I think this may be something to do with the spacing between the fields. Note that I've also amended the input above, since I missed out that there was actually no _
in between the words forming Hybridization REF
and also those forming Composite Element REF
推荐答案
您真的不希望将输入数据加载到内存中,因为它是如此之大。相反,流的方法会更快,并为此 AWK
很适合:
You don't really want to load the input data into memory, because it's so large. Instead, a streaming approach will be faster, and for this awk
is well suited:
#!/usr/bin/awk -f
BEGIN {
FS = "\t";
OFS = FS;
}
NR == 1 {
# collect sample names
for (i=1; i <= NF; i++) {
sample[i] = $i
}
}
NR == 2 {
# first four columns are always the same
cols[1] = 1
cols[2] = 3
cols[3] = 4
cols[4] = 5
printf "%s %s %s %s ", sample[1], $3, $4, $5
# dynamic columns (in practice: 2,6,10,...)
for (i=1; i <= NF; i++) {
if ($i == "Beta_value") {
cols[length(cols)+1] = i
printf "%s ", sample[i]
}
}
printf "\n"
}
NR >= 3 {
# print cols from data row
for (i=1; i <= length(cols); i++) {
printf "%s ", $cols[i]
}
printf "\n"
}
这使您所需的输出。如果您想了解更多的速度,你可以考虑使用 AWK
简单地打印列号(只需要读取两个标题行),那么削减
实际打印。因为没有跨preTED code需要为每个数据行运行这个会更快。对于问题的样本数据中,剪切
命令你需要打印所有数据行是这样的:
This gives your desired output. If you want more speed, you might consider using awk
simply to print the column numbers (which only requires reading the two header rows), then cut
to actually print them. This will be faster because no interpreted code needs to run for each data row. For the sample data in the question, the cut
command you need to print all the data rows is something like this:
cut -d '\t' -f 1,3,4,5,2,6
这篇关于如何有选择性地删除列和行bash或蟒蛇的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!