移调在bash文件 [英] Transpose a file in bash

查看:123
本文介绍了移调在bash文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有格式化这样一个巨大的制表符分隔文件

  X列1列2栏3
ROW1 0 1 2
ROW2 3 4 5
ROW3 6 7 8
ROW4 9 10 11

我想的的它仅使用的bash命令的有效方式(我可以写十行左右的Perl脚本来做到这一点,但它应该是慢于原生的bash执行功能)。所以输出应该看起来像

  X ROW1 ROW2 ROW3 ROW4
COLUMN1 0 3 6 9
列2 1 4 7 10
栏3 2 5 8 11

我觉得像这样

一个解决方案

  COLS =`头-n 1输入| WC -w`
为((i = 1; I< = $ COLS;我+ +))
做剪-f $ I输入| TR $的'\\ n'$'\\ t'| SED -eS / \\ T $ / \\ n / g的>>产量
DONE

但它是缓慢的,而且似乎不是最有效的解决方案。我见过这个帖子为VI的解决方案,但它的还是过慢。任何想法/建议/高见? : - )


解决方案

 的awk'
{
    对于(i = 1; I< = NF;我++){
        一个[NR,I] = $ I
    }
}
的NF指p {对= NF}
结束 {
    为(J = 1; J< = P; J ++){
        海峡= A [1,J]。
        对于(i = 2; I< = NR;我++){
            海峡= STR一[I,J]。
        }
        打印海峡
    }
}'文件

输出

  $多个文件
0 1 2
3 4 5
6 7 8
9 10 11$ ./shell.sh
0 3 6 9
1 4 7 10
2 5 8 11

针对Perl的解决方案的性能乔纳森在10000行文件

  $头-5文件
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
1 0 1 2$厕所-l<文件
10000$ perl的时间test.pl文件>的/ dev / null的真正0m0.480s
用户0m0.442s
SYS 0m0.026s$时间AWK -f test.awk文件>的/ dev / null的真正0m0.382s
用户0m0.367s
SYS 0m0.011s$ perl的时间test.pl文件>的/ dev / null的真正0m0.481s
用户0m0.431s
SYS 0m0.022s$时间AWK -f test.awk文件>的/ dev / null的真正0m0.390s
用户0m0.370s
SYS 0m0.010s

埃德莫顿EDIT(@ ghostdog74随意删除,如果你不同意)。

也许这个版本与一些更明确的变量名称将有助于回答以下一些问题和一般澄清脚本正在做什么。它还使用标签作为OP的最初要求,因此会处理空字段的分隔符,并不约而同pretties使输出有点为这个特殊情况。

  $猫tst.awk
BEGIN {FS = OFS =\\ t的}
{
    为(rowNr = 1; rowNr&下; = NF; rowNr ++){
        细胞[rowNr,NR = $ rowNr
    }
    maxRows进行=(NF> maxRows进行NF:maxRows进行)
    maxCols = NR
}
结束 {
    为(rowNr = 1; rowNr&下; = maxRows进行; rowNr ++){
        为(colNr = 1; colNr&下; = maxCols; colNr ++){
            printf的%s%S,细胞[rowNr,colNr](colNr< maxCols OFS:ORS)
        }
    }
}$ AWK -f tst.awk文件
点¯xROW1 ROW2 ROW3 ROW4
COLUMN1 0 3 6 9
列2 1 4 7 10
栏3 2 5 8 11

以上解决方案将在任何AWK工作。(当然除了旧,破AWK - 有因人而异)

以上方法读取整个文件到内存中,但 - 如果输入文件是针对过大的话,你可以这样做:

  $猫tst.awk
BEGIN {FS = OFS =\\ t的}
{printf的%s%S(FNR大于1 OFS:),$ ARGIND}
ENDFILE {
    打印
    如果(ARGIND< NF){
        ARGV [ARGC] =文件名
        ARGC ++
    }
}
$ AWK -f tst.awk文件
点¯xROW1 ROW2 ROW3 ROW4
COLUMN1 0 3 6 9
列2 1 4 7 10
栏3 2 5 8 11

,它使用几乎没有记忆,但每一个行中的字段的数目读取输入文件一旦所以这将是比读取整个文件到存储器中的版本要慢得多。它还假定字段数,每行相同,它使用了GNU AWK为 ENDFILE ARGIND 但是任何AWK可以做测试相同的 FNR == 1 END

I have a huge tab-separated file formatted like this

X column1 column2 column3
row1 0 1 2
row2 3 4 5
row3 6 7 8
row4 9 10 11

I would like to transpose it in an efficient way using only bash commands (I could write a ten or so lines Perl script to do that, but it should be slower to execute than the native bash functions). So the output should look like

X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11

I thought of a solution like this

cols=`head -n 1 input | wc -w`
for (( i=1; i <= $cols; i++))
do cut -f $i input | tr $'\n' $'\t' | sed -e "s/\t$/\n/g" >> output
done

But it's slow and doesn't seem the most efficient solution. I've seen a solution for vi in this post, but it's still over-slow. Any thoughts/suggestions/brilliant ideas? :-)

解决方案

awk '
{ 
    for (i=1; i<=NF; i++)  {
        a[NR,i] = $i
    }
}
NF>p { p = NF }
END {    
    for(j=1; j<=p; j++) {
        str=a[1,j]
        for(i=2; i<=NR; i++){
            str=str" "a[i,j];
        }
        print str
    }
}' file

output

$ more file
0 1 2
3 4 5
6 7 8
9 10 11

$ ./shell.sh
0 3 6 9
1 4 7 10
2 5 8 11

Performance against Perl solution by Jonathan on a 10000 lines file

$ head -5 file
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
1 0 1 2

$  wc -l < file
10000

$ time perl test.pl file >/dev/null

real    0m0.480s
user    0m0.442s
sys     0m0.026s

$ time awk -f test.awk file >/dev/null

real    0m0.382s
user    0m0.367s
sys     0m0.011s

$ time perl test.pl file >/dev/null

real    0m0.481s
user    0m0.431s
sys     0m0.022s

$ time awk -f test.awk file >/dev/null

real    0m0.390s
user    0m0.370s
sys     0m0.010s

EDIT by Ed Morton (@ghostdog74 feel free to delete if you disapprove).

Maybe this version with some more explicit variable names will help answer some of the questions below and generally clarify what the script is doing. It also uses tabs as the separator which the OP had originally asked for so it'd handle empty fields and it coincidentally pretties-up the output a bit for this particular case.

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
    for (rowNr=1;rowNr<=NF;rowNr++) {
        cell[rowNr,NR] = $rowNr
    }
    maxRows = (NF > maxRows ? NF : maxRows)
    maxCols = NR
}
END {
    for (rowNr=1;rowNr<=maxRows;rowNr++) {
        for (colNr=1;colNr<=maxCols;colNr++) {
            printf "%s%s", cell[rowNr,colNr], (colNr < maxCols ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk file
X       row1    row2    row3    row4
column1 0       3       6       9
column2 1       4       7       10
column3 2       5       8       11

The above solutions will work in any awk (except old, broken awk of course - there YMMV).

The above solutions do read the whole file into memory though - if the input files are too large for that then you can do this:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ printf "%s%s", (FNR>1 ? OFS : ""), $ARGIND }
ENDFILE {
    print ""
    if (ARGIND < NF) {
        ARGV[ARGC] = FILENAME
        ARGC++
    }
}
$ awk -f tst.awk file
X       row1    row2    row3    row4
column1 0       3       6       9
column2 1       4       7       10
column3 2       5       8       11

which uses almost no memory but reads the input file once per number of fields on a line so it will be much slower than the version that reads the whole file into memory. It also assumes the number of fields is the same on each line and it uses GNU awk for ENDFILE and ARGIND but any awk can do the same with tests on FNR==1 and END.

这篇关于移调在bash文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆