是否可以通过GNU并行化将awk写入多个文件并行化? [英] Is it possible to parallelize awk writing to multiple files through GNU parallel?

查看:158
本文介绍了是否可以通过GNU并行化将awk写入多个文件并行化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行awk脚本,该脚本想通过GNU并行进行并行化.

I am running an awk script which I want to parallelize through GNU parallel.

此脚本根据每一行上的值将一个输入文件多路分解为多个输出文件.代码如下:

This script demultiplexes one input file to multiple output files depending on a value on each line. The code is the following:

#!/usr/bin/awk -f

BEGIN{ FS=OFS="\t" }
{
    # bc is the field that defines to which file the line
    # will be written
    bc = $1
    # append line to such file
    print >> (bc".txt")
}

我想通过以下方式使用GNU parallel对其进行并行化:

I want to parallelize it using GNU parallel through the following:

parallel --line-buffer --block 1G --pipe 'awk script.awk'

但是,我担心两个awk进程同时写入同一文件的可能的竞争条件.有可能吗?如果可以,如何在不影响并行化的情况下避免这种情况?

However, I am afraid of possible race conditions in which two awk processes are writing in the same file at the same time. Is it possible, and if yes how to avoid it without compromising parallelization?

NB.我包括了--line-buffer选项,尽管我不确定它是否也适用于awk脚本中的文件重定向.在这种情况下还是仅适用于每个awk进程的标准输出?

NB. I included --line-buffer option although I'm not sure if it applies also to file redirection within the awk script. Does it apply also in this case or only to stdout of each awk process?

# Input file
bc1    line1
bc3    line2
bc1    line3
bc2    line4


# Output file bc1.txt
bc1    line1
bc1    line3

# Output file bc2.txt
bc2    line4

# Output file bc3.txt
bc3    line2

推荐答案

您可以通过在不同的目录中多路分解输出来实现:

You can do it by demultiplexing the output in different dirs:

stuff |
  parallel --block 10M --pipe --round-robin \
    'mkdir -p dir-{%}; cd dir-{%}; awk ../script.awk'

或者如果输入是文件,则可以使用更快的--pipepart:

Or if input is a file, you can use --pipepart which is faster:

parallel --block -1 --pipepart -a bigfile \
  'mkdir -p dir-{%}; cd dir-{%}; awk ../script.awk'

那么就没有比赛条件.通过合并目录来结束:

Then there is no race condition. Finish up by merging the dirs:

parallel 'cd {}; ls' ::: dir-* | sort -u |
  parallel 'cat */{} > {}'

如果不接受合并(也许您没有2个数据副本的磁盘空间),则可以使用fifos.但是要做到这一点,您需要事先知道所有.txt文件的名称,并且需要一个可以并行运行每个名称一个进程的系统(10000个名称= 10000个进程):

If merging is not acceptable (maybe you do not have disk space for 2 copies of the data), you can use fifos. But to do that you need to know the names of all the .txt-files in advance and you need a system that can run one process per name in parallel (10000 names = 10000 processes):

# Generate names-of-files.txt somehow
# Make fifos for all names in all slots
parallel 'mkdir -p {2}; mkfifo {2}/{1}' :::: \
  names-of-files.txt <(seq $(parallel --number-of-threads) )
# Run the demultiplexer in the background
parallel --block -1 --pipepart -a bigfile \
  'mkdir -p dir-{%}; cd dir-{%}; awk ../script.awk' &
# Start one process per name
# If you have more than 32000 names, you will need to increase the number
# of processes on your system.
cat names-of-files.txt |
  parallel -j0 --pipe -N250 -I ,, parallel -j0 'parcat */{} > {}'

这篇关于是否可以通过GNU并行化将awk写入多个文件并行化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆