如何编写任意长的管道链? [英] how do I code an arbitrarily long chain of pipes?

查看:60
本文介绍了如何编写任意长的管道链?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Linux环境有些陌生.我四处寻找这个问题的答案,如果以前有人问过,我深表歉意.

I am somewhat new to the Linux environment. I looked all over for an answer to this -- apologies if this has been asked before.

我写了一个对大文本文件(11个演出,40列,4800万行)运行的awk脚本.该脚本称为"cycle.awk".它将列替换为新版本.它要求首先按该特定列对数据进行排序.为了在所有列上运行脚本,我编写了一个bash命令,如下所示:

I wrote an awk script that operates on a big text file (11 gigs, 40 columns, 48M rows). The script is called "cycle.awk." It replace a column with a new version of it. It requires the data to be sorted first by that particular column. In order to run the script on all the columns, I wrote a bash command like this:

cat input.csv |
    sort -k 22 -t "," | awk -v val=22 -f cycle.awk |
    sort -k 23 -t "," | awk -v val=23 -f cycle.awk |
    sort -k 24 -t "," | awk -v val=24 -f cycle.awk |
    sort -k 25 -t "," | awk -v val=25 -f cycle.awk |
    sort -k 26 -t "," | awk -v val=26 -f cycle.awk |
    sort -k 27 -t "," | awk -v val=27 -f cycle.awk |
    sort -k 28 -t "," | awk -v val=28 -f cycle.awk |
    sort -k 29 -t "," | awk -v val=29 -f cycle.awk |
    sort -k 30 -t "," | awk -v val=30 -f cycle.awk |
    sort -k 31 -t "," | awk -v val=31 -f cycle.awk |
    sort -k 32 -t "," | awk -v val=32 -f cycle.awk |
    sort -k 33 -t "," | awk -v val=33 -f cycle.awk |
    sort -k 34 -t "," | awk -v val=34 -f cycle.awk |
    sort -k 35 -t "," | awk -v val=35 -f cycle.awk |
    sort -k 36 -t "," | awk -v val=36 -f cycle.awk |
    sort -k 37 -t "," | awk -v val=37 -f cycle.awk |
    sort -k 38 -t "," | awk -v val=38 -f cycle.awk |
    sort -k 39 -t "," | awk -v val=39 -f cycle.awk |
    sort -k 40 -t "," | awk -v val=40 -f cycle.awk |
    sort -k 41 -t "," | awk -v val=41 -f cycle.awk > output.csv

我认为必须有一种更优雅的方法来做到这一点.如何编写bash脚本,使我可以传递我想应用awk脚本的列,然后运行这种管道过程,而无需产生任何临时数据文件?我避免使用临时文件,因为输入文件太大,我对最佳性能感兴趣.

I figure there must be a more elegant way to do this. How can I write a bash script that will allow me to pass the columns I want to apply my awk script and then run this kind of piping procedure without needing to produce any temporary data files? I am avoiding temporary files because the input file is so large and I am interested in optimal performance.

顺便说一句,脚本如下.基本上,出于压缩文本文件的目的,它缩短了某些列的值.关于如何加强它的任何指示?此过程大约需要10个小时.

BTW, the script is as follows. It basically shortens the values of some columns for purposes of compressing the text file. Any pointers on how to tighten it up? This procedures takes about 10 hours to run.

BEGIN{ FS=","; OFS=","; count=1 }
NR == 1 { temp=$val }
{
    if ( temp != $val ) {
        temp=$val;
        count++;
    }
    $val=count
    print $0
}

输入通常看起来像这样:

Input typically looks something like this:

id,c1
1,abcd
2,efgh
3,abcd
4,abcd
5,efgh

相应的输出将是:

id,c1
1,1
2,2
3,1
4,1
5,2

从技术上讲,它将按c1进行排序,但这不是重点.

Technically, it would be sorted by c1 but that's not the point.

推荐答案

真正的正确答案是重写您的进程,不需要这种管道.但是,如果您想建立这样的管道,请使用递归函数(该管道本身):

The real Right Answer is to rewrite your process to not need this kind of pipeline. However, if you do want to set up such a pipeline, use a recursive function (that pipes to itself):

process_column() {
  sort -k "$1" -t, | awk -v val="$1" -f cycle.awk
}

process_column_range() {
  local min_col=$1
  local max_col=$2
  if (( min_col < max_col )); then
    process_column "$min_col" \
     | process_column_range "$(( min_col + 1 ))" "$max_col"
  else
    process_column "$min_col"
  fi
}

...然后进行调用(注意不需要cat):

...and then, to invoke (notice that no cat is needed):

process_column_range 22 41 <input.csv >output.csv

这篇关于如何编写任意长的管道链?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆