命令行支点 [英] command line pivot

查看:142
本文介绍了命令行支点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在狩猎绕不过去的几天一组命令行工具,一个perl或awk脚本,让我很快转以下数据:

I've been hunting around the past few days for a set of command line tools, a perl or an awk script that allow me to very quickly transpose the following data:

Row|Col|Val
1|A|foo
1|B|bar
1|C|I have a real
2|A|bad
2|C|hangover

这个:

A|B|C
foo|bar|I have a real
bad||hangover

需要注意的是在数据集中为每一个细胞只有一个值(即,与A S preadsheet,没有排任何重复1上校A)

Note that there is only one value in the dataset for each "cell" (i.e., as with a spreadsheet, there aren't any duplicates of Row "1" Col "A")

我已经试过各种外壳awk的实现方案移调数据 - 但似乎无法让他们的工作。我有一个想法是削减每个上校的价值在不同的文件,然后使用连接命令行中行把他们重新走到一起 - 但必须有一个更简单的方法。我敢肯定,这只是非常简单的事情 - 但我挣扎了一下

I've tried various awk shell implementations for transposing data - but can't seem to get them working. One idea I had was to cut each "Col" value into a separate file, then use the "join" command line to put them back together by "Row" -- but there MUST be an easier way. I'm sure this is just incredibly simple to do - but I'm struggling a bit.

我的输入文件到G COLS A(主要包括可变长度的字符串)和10,000行。如果我能避免加载一切都变成内存将是一个巨大的加分。

My input files have Cols A through G (mostly including variable length strings), and 10,000 Rows. If I can avoid loading everything into memory that would be a huge plus.

啤酒通过邮件的人谁的得到了答案!

Beer-by-mail for anyone who's got the answer!

像往常一样 - 提前许多感谢您的帮助。

As always - many thanks in advance for your help.

干杯,

乔希

P.S。 - 我有点惊讶的是没有UTIL一个不折不扣的现成的命令行做枢轴/换位这种操作非常基本的类型。我看着的http://$c$c.google.com/p/openpivot/ ,在 HTTP://$c$c.google.com/p /粉碎工具似乎/ 这两个要求总Calcs(计算)。

p.s. - I'm a bit surprised that there isn't an out-of-the-box command line util for doing this very basic type of pivot/transposition operation. I looked at http://code.google.com/p/openpivot/ and at http://code.google.com/p/crush-tools/ both of which seem to require aggregate calcs.

推荐答案

我能做到这一点的GAWK,但不NAWK。

I can do this in gawk, but not nawk.

#!/usr/local/bin/gawk -f

BEGIN {
  FS="|";
}

{
  rows[$1]=1; cols[$2]=1; values[$1][$2]=$3;
}

END {
  for (col in cols) {
    output=output sprintf("|%s", col);
  }
  print substr(output, 2);
  for (row in rows) {
    output="";
    for (col in cols) {
      output=output sprintf("|%s", values[row][col]);
    }
    print substr(output, 2);
  }
}

和它甚至还可以:

ghoti@pc $ cat data
1|A|foo
1|B|bar
1|C|I have a real
2|A|bad
2|C|hangover
ghoti@pc $ ./doit.gawk data
A|B|C
foo|bar|I have a real
bad||hangover
ghoti@pc $ 

我不知道有多好,这将有10000行的工作,但我怀疑,如果你已经得到了它的记忆,你会没事的。我看不出你如何通过存储在独立的文件,你会在以后加入的东西避免装载的东西到内存中的除了的。这是pretty太大的手动实现虚拟内存。

I'm not sure how well this will work with 10000 rows, but I suspect if you've got the memory for it, you'll be fine. I can't see how you can avoid loading things into memory except by storing things in separate files which you'd later join. Which is pretty much a manual implementation of virtual memory.

更新:

每评论:

#!/usr/local/bin/gawk -f

BEGIN {
  FS="|";
}

{
  rows[$1]=1; cols[$2]=1; values[$1,$2]=$3;
}

END {
  for (col in cols) {
    output=output sprintf("|%s", col);
  }
  print output;
  for (row in rows) {
    output="";
    for (col in cols) {
      output=output "|" values[row,col];
    }
    print row output;
  }
}

和输出:

ghoti@pc $ ./doit.awk data
|A|B|C
1|foo|bar|I have a real
2|bad||hangover
ghoti@pc $ 

这篇关于命令行支点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆