在Bash中转置文件的有效方法 [英] An efficient way to transpose a file in Bash

查看:90
本文介绍了在Bash中转置文件的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个巨大的制表符分隔文件,格式如下:

I have a huge tab-separated file formatted like this

X column1 column2 column3
row1 0 1 2
row2 3 4 5
row3 6 7 8
row4 9 10 11

我想只使用bash命令以有效的方式转置(我可以编写十行左右的Perl脚本来做到这一点,但执行起来应该比本地bash慢功能).所以输出应该像

I would like to transpose it in an efficient way using only bash commands (I could write a ten or so lines Perl script to do that, but it should be slower to execute than the native bash functions). So the output should look like

X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11

我想到了这样的解决方案

I thought of a solution like this

cols=`head -n 1 input | wc -w`
for (( i=1; i <= $cols; i++))
do cut -f $i input | tr $'\n' $'\t' | sed -e "s/\t$/\n/g" >> output
done

但是它很慢,而且似乎不是最有效的解决方案.我在这篇文章中看到了vi的解决方案,但这是仍然过慢.有什么想法/建议/好主意吗? :-)

But it's slow and doesn't seem the most efficient solution. I've seen a solution for vi in this post, but it's still over-slow. Any thoughts/suggestions/brilliant ideas? :-)

推荐答案

awk '
{ 
    for (i=1; i<=NF; i++)  {
        a[NR,i] = $i
    }
}
NF>p { p = NF }
END {    
    for(j=1; j<=p; j++) {
        str=a[1,j]
        for(i=2; i<=NR; i++){
            str=str" "a[i,j];
        }
        print str
    }
}' file

输出

$ more file
0 1 2
3 4 5
6 7 8
9 10 11

$ ./shell.sh
0 3 6 9
1 4 7 10
2 5 8 11

Jonathan在10000行文件中针对Perl解决方案的性能

Performance against Perl solution by Jonathan on a 10000 lines file

$ head -5 file
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
1 0 1 2

$  wc -l < file
10000

$ time perl test.pl file >/dev/null

real    0m0.480s
user    0m0.442s
sys     0m0.026s

$ time awk -f test.awk file >/dev/null

real    0m0.382s
user    0m0.367s
sys     0m0.011s

$ time perl test.pl file >/dev/null

real    0m0.481s
user    0m0.431s
sys     0m0.022s

$ time awk -f test.awk file >/dev/null

real    0m0.390s
user    0m0.370s
sys     0m0.010s

埃德·莫顿(Ed Morton)编辑(@ ghostdog74,如果您不同意,可以删除).

EDIT by Ed Morton (@ghostdog74 feel free to delete if you disapprove).

也许此版本带有一些更明确的变量名将有助于回答以下一些问题,并通常阐明脚本的功能.它还使用制表符作为OP最初要求的分隔符,因此它可以处理空字段,并且在这种情况下,它会巧合地对输出进行修饰.

Maybe this version with some more explicit variable names will help answer some of the questions below and generally clarify what the script is doing. It also uses tabs as the separator which the OP had originally asked for so it'd handle empty fields and it coincidentally pretties-up the output a bit for this particular case.

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
    for (rowNr=1;rowNr<=NF;rowNr++) {
        cell[rowNr,NR] = $rowNr
    }
    maxRows = (NF > maxRows ? NF : maxRows)
    maxCols = NR
}
END {
    for (rowNr=1;rowNr<=maxRows;rowNr++) {
        for (colNr=1;colNr<=maxCols;colNr++) {
            printf "%s%s", cell[rowNr,colNr], (colNr < maxCols ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk file
X       row1    row2    row3    row4
column1 0       3       6       9
column2 1       4       7       10
column3 2       5       8       11

以上解决方案可在任何awk中使用(当然,旧的破损awk除外-当然还有YMMV).

The above solutions will work in any awk (except old, broken awk of course - there YMMV).

以上解决方案确实将整个文件读入内存-如果输入文件太大,则可以执行以下操作:

The above solutions do read the whole file into memory though - if the input files are too large for that then you can do this:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ printf "%s%s", (FNR>1 ? OFS : ""), $ARGIND }
ENDFILE {
    print ""
    if (ARGIND < NF) {
        ARGV[ARGC] = FILENAME
        ARGC++
    }
}
$ awk -f tst.awk file
X       row1    row2    row3    row4
column1 0       3       6       9
column2 1       4       7       10
column3 2       5       8       11

几乎不占用内存,但每行上每个字段数只读取一次输入文件,因此它比将整个文件读取到内存中的版本要慢得多.它还假定每行上的字段数相同,并且对ENDFILEARGIND使用GNU awk,但是任何awk都可以对FNR==1END进行测试.

which uses almost no memory but reads the input file once per number of fields on a line so it will be much slower than the version that reads the whole file into memory. It also assumes the number of fields is the same on each line and it uses GNU awk for ENDFILE and ARGIND but any awk can do the same with tests on FNR==1 and END.

这篇关于在Bash中转置文件的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆