在运行其余管道的同时捕获中间管道结果的导数 [英] Capture derivative of an intermediate pipeline result while running rest of pipeline as well
问题描述
我有一堆包含某些压缩文本数据的*.bz2
文件.我将它们解压缩到单个文件intermediary.txt
中,以计算子字符串myString
的出现:
I have a bunch of *.bz2
files containing some compressed text data. I decompress them into a single file intermediary.txt
to count the occurrences of the substring myString
:
find . -name '*bz2' -exec bzip2 -k -c -d {} > intermediary.txt
然后(计算myString
的出现次数)
and then (to count the number of occurrences of myString
)
echo "Number of occurrences:"
grep -o "myString" intermediary.txt | wc -w
然后通过一些流操作继续处理:
Processing then continues by some stream manipulations:
cat intermediate.txt | sed ... | sed ... | someCommand > out.txt
我现在想在一个管道中处理所有步骤,即在out.txt
中得到结果,并且在stdout上仍然具有myString
的出现次数,而不必编写intermediary.txt
.因此管道应如下所示:
I now want to process all the steps in one pipeline, i.e. have the result in out.txt
and still have the number of occurrences of myString
on stdout without having to write intermediary.txt
. So the pipeline should look something like this:
find . -name '*bz2' -exec bzip2 -k -c -d {} | <some magic here> | sed ... | sed ... | someCommand > out.txt
(怎么)可能?
更新
我在下面尝试了@Charles Duffy的版本,但是对bzip2
部分进行了一些修改,以改为使用bzcat
.我认为它不太冗长,并且不应该影响性能(虽然不确定).
UPDATE
I tried out @Charles Duffy's version below, but modified the bzip2
-part a bit to use bzcat
instead. I think it's a bit less verbose and it should not affect performance (not sure though).
这可以完成工作.但是,现在最好将此管道包含在管道查看器中,以获得一些反馈关于进度(有很多*.bz2
文件!).用pv -cN source < ...
前缀整个内容不起作用.我为此
This gets the job done. However, it would be nice now to include this pipeline in Pipeline Viewer to get some feedback about the progress (there are a lot of *.bz2
files!). Prefixing the whole thing with pv -cN source < ...
does not work. I posted a separate question for this here
推荐答案
减少复杂性,以将捕获的内容作为流水线末端的标准输出,并使用进程替代其他输出:
Fewer complications to keep what you're capturing as the stdout of the end of the pipeline, and use a process substitution for other outputs:
result=$(find . -name '*bz2' -exec bzip2 -k -c -d {} + \
| tee >(sed ... | sed ... | someCommand >out.txt) \
| grep -e myString \
| wc -l)
请注意使用-exec ... {} +
,它比以前使用的find
操作(为每个输出文件运行一个单独的bzip2
副本)效率要高得多.
Note the use of -exec ... {} +
, which is significantly more efficient than the find
operation you were using before (which ran a separate copy of bzip2
for each output file).
这篇关于在运行其余管道的同时捕获中间管道结果的导数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!