在运行其余管道的同时捕获中间管道结果的导数 [英] Capture derivative of an intermediate pipeline result while running rest of pipeline as well

查看:89
本文介绍了在运行其余管道的同时捕获中间管道结果的导数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆包含某些压缩文本数据的*.bz2文件.我将它们解压缩到单个文件intermediary.txt中,以计算子字符串myString的出现:

I have a bunch of *.bz2 files containing some compressed text data. I decompress them into a single file intermediary.txt to count the occurrences of the substring myString:

find . -name '*bz2' -exec bzip2 -k -c -d {} > intermediary.txt

然后(计算myString的出现次数)

and then (to count the number of occurrences of myString)

echo "Number of occurrences:"
grep -o "myString" intermediary.txt | wc -w

然后通过一些流操作继续处理:

Processing then continues by some stream manipulations:

cat intermediate.txt | sed ... | sed ... | someCommand > out.txt

我现在想在一个管道中处理所有步骤,即在out.txt中得到结果,并且在stdout上仍然具有myString的出现次数,而不必编写intermediary.txt.因此管道应如下所示:

I now want to process all the steps in one pipeline, i.e. have the result in out.txt and still have the number of occurrences of myString on stdout without having to write intermediary.txt. So the pipeline should look something like this:

find . -name '*bz2' -exec bzip2 -k -c -d {} | <some magic here> | sed ... | sed ... | someCommand > out.txt

(怎么)可能?

更新 我在下面尝试了@Charles Duffy的版本,但是对bzip2部分进行了一些修改,以改为使用bzcat.我认为它不太冗长,并且不应该影响性能(虽然不确定).

UPDATE I tried out @Charles Duffy's version below, but modified the bzip2-part a bit to use bzcat instead. I think it's a bit less verbose and it should not affect performance (not sure though).

这可以完成工作.但是,现在最好将此管道包含在管道查看器中,以获得一些反馈关于进度(有很多*.bz2文件!).用pv -cN source < ...前缀整个内容不起作用.我为此

This gets the job done. However, it would be nice now to include this pipeline in Pipeline Viewer to get some feedback about the progress (there are a lot of *.bz2 files!). Prefixing the whole thing with pv -cN source < ... does not work. I posted a separate question for this here

推荐答案

减少复杂性,以将捕获的内容作为流水线末端的标准输出,并使用进程替代其他输出:

Fewer complications to keep what you're capturing as the stdout of the end of the pipeline, and use a process substitution for other outputs:

result=$(find . -name '*bz2' -exec bzip2 -k -c -d {} + \
          | tee >(sed ... | sed ... | someCommand >out.txt) \
          | grep -e myString \
          | wc -l)

请注意使用-exec ... {} +,它比以前使用的find操作(为每个输出文件运行一个单独的bzip2副本)效率要高得多.

Note the use of -exec ... {} +, which is significantly more efficient than the find operation you were using before (which ran a separate copy of bzip2 for each output file).

这篇关于在运行其余管道的同时捕获中间管道结果的导数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆