命令行支点 [英] command line pivot
问题描述
我一直在狩猎绕不过去的几天一组命令行工具,一个perl或awk脚本,让我很快转以下数据:
I've been hunting around the past few days for a set of command line tools, a perl or an awk script that allow me to very quickly transpose the following data:
Row|Col|Val
1|A|foo
1|B|bar
1|C|I have a real
2|A|bad
2|C|hangover
这个:
A|B|C
foo|bar|I have a real
bad||hangover
需要注意的是在数据集中为每一个细胞只有一个值(即,与A S preadsheet,没有排任何重复1上校A)
Note that there is only one value in the dataset for each "cell" (i.e., as with a spreadsheet, there aren't any duplicates of Row "1" Col "A")
我已经试过各种外壳awk的实现方案移调数据 - 但似乎无法让他们的工作。我有一个想法是削减每个上校的价值在不同的文件,然后使用连接命令行中行把他们重新走到一起 - 但必须有一个更简单的方法。我敢肯定,这只是非常简单的事情 - 但我挣扎了一下
I've tried various awk shell implementations for transposing data - but can't seem to get them working. One idea I had was to cut each "Col" value into a separate file, then use the "join" command line to put them back together by "Row" -- but there MUST be an easier way. I'm sure this is just incredibly simple to do - but I'm struggling a bit.
我的输入文件到G COLS A(主要包括可变长度的字符串)和10,000行。如果我能避免加载一切都变成内存将是一个巨大的加分。
My input files have Cols A through G (mostly including variable length strings), and 10,000 Rows. If I can avoid loading everything into memory that would be a huge plus.
啤酒通过邮件的人谁的得到了答案!
Beer-by-mail for anyone who's got the answer!
像往常一样 - 提前许多感谢您的帮助。
As always - many thanks in advance for your help.
干杯,
乔希
P.S。 - 我有点惊讶的是没有UTIL一个不折不扣的现成的命令行做枢轴/换位这种操作非常基本的类型。我看着的http://$c$c.google.com/p/openpivot/ ,在 HTTP://$c$c.google.com/p /粉碎工具似乎/ 这两个要求总Calcs(计算)。
p.s. - I'm a bit surprised that there isn't an out-of-the-box command line util for doing this very basic type of pivot/transposition operation. I looked at http://code.google.com/p/openpivot/ and at http://code.google.com/p/crush-tools/ both of which seem to require aggregate calcs.
推荐答案
我能做到这一点的GAWK,但不NAWK。
I can do this in gawk, but not nawk.
#!/usr/local/bin/gawk -f
BEGIN {
FS="|";
}
{
rows[$1]=1; cols[$2]=1; values[$1][$2]=$3;
}
END {
for (col in cols) {
output=output sprintf("|%s", col);
}
print substr(output, 2);
for (row in rows) {
output="";
for (col in cols) {
output=output sprintf("|%s", values[row][col]);
}
print substr(output, 2);
}
}
和它甚至还可以:
ghoti@pc $ cat data
1|A|foo
1|B|bar
1|C|I have a real
2|A|bad
2|C|hangover
ghoti@pc $ ./doit.gawk data
A|B|C
foo|bar|I have a real
bad||hangover
ghoti@pc $
我不知道有多好,这将有10000行的工作,但我怀疑,如果你已经得到了它的记忆,你会没事的。我看不出你如何通过存储在独立的文件,你会在以后加入的东西避免装载的东西到内存中的除了的。这是pretty太大的手动实现虚拟内存。
I'm not sure how well this will work with 10000 rows, but I suspect if you've got the memory for it, you'll be fine. I can't see how you can avoid loading things into memory except by storing things in separate files which you'd later join. Which is pretty much a manual implementation of virtual memory.
更新:
每评论:
#!/usr/local/bin/gawk -f
BEGIN {
FS="|";
}
{
rows[$1]=1; cols[$2]=1; values[$1,$2]=$3;
}
END {
for (col in cols) {
output=output sprintf("|%s", col);
}
print output;
for (row in rows) {
output="";
for (col in cols) {
output=output "|" values[row,col];
}
print row output;
}
}
和输出:
ghoti@pc $ ./doit.awk data
|A|B|C
1|foo|bar|I have a real
2|bad||hangover
ghoti@pc $
这篇关于命令行支点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!