grep-如何输出进度条或状态 [英] grep - how to output progress bar or status

查看:393
本文介绍了grep-如何输出进度条或状态的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有时候我正在grep-处理数千个文件,很高兴看到某种进度(进度条或状态).

Sometimes I'm grep-ing thousands of files and it'd be nice to see some kind of progress (bar or status).

我知道这并非易事,因为grep将搜索结果输出到 STDOUT ,并且我的默认工作流程是将结果输出到文件中,并希望输出进度条/状态到 STDOUT STDERR .

I know this is not trivial because grep outputs the search results to STDOUT and my default workflow is that I output the results to a file and would like the progress bar/status to be output to STDOUT or STDERR .

这是否需要修改grep的源代码?

Would this require modifying source code of grep?

理想的命令是:

grep -e "STRING" --results="FILE.txt"

和进度:

[curr file being searched], number x/total number of files

写入 STDOUT STDERR

推荐答案

虽然修改后可能会获得更准确的进度条,但这不一定需要修改grep.

This wouldn't necessarily require modifying grep, although you could probably get a more accurate progress bar with such a modification.

如果仅通过一次grep调用来复制大量文件",则很可能是使用-r选项来递归目录结构.在那种情况下,甚至还不清楚grep是否知道将检查多少个文件,因为我相信它会在探索整个目录结构之前开始检查文件.首先研究目录结构可能会增加总扫描时间(并且,确实,产生进度报告总是要付出代价的,这就是为什么很少有传统的Unix实用程序这样做的原因.)

If you are grepping "thousands of files" with a single invocation of grep, it is most likely that you are using the -r option to recursively a directory structure. In that case, it is not even clear that grep knows how many files it will examine, because I believe it starts examining files before it explores the entire directory structure. Exploring the directory structure first would probably increase the total scan time (and, indeed, there is always a cost to producing progress reports, which is why few traditional Unix utilities do this.)

在任何情况下,都可以通过构造要扫描的文件的完整列表,然后以一定大小(也许100个或基于总大小)的批次将它们馈送到grep来获得简单但略微不正确的进度条.批次的大小.小批量将允许更准确的进度报告,但它们也将增加开销,因为它们将需要额外的grep进程启动,并且进程启动时间可能比grep小文件还要多.进度报告将针对每批文件进行更新,因此您希望选择一个批处理大小,以便在不增加过多开销的情况下进行定期更新.将批处理大小基于文件的总大小(例如,使用stat获取文件大小)将使进度报告更加准确,但会增加流程启动的额外成本.

In any case, a simple but slightly inaccurate progress bar could be obtained by constructing the complete list of files to be scanned and then feeding them to grep in batches of some size, maybe 100, or maybe based on the total size of the batch. Small batches would allow for more accurate progress reports but they would also increase overhead since they would require additional grep process start-up, and the process start-up time can be more than grepping a small file. The progress report would be updated for each batch of files, so you would want to choose a batch size that gave you regular updates without increasing overhead too much. Basing the batch size on the total size of the files (using, for example, stat to get the filesize) would make the progress report more exact but add an additional cost to process startup.

此策略的一个优势是您还可以并行运行两个或多个抓取,这可能会加快处理速度.

One advantage of this strategy is that you could also run two or more greps in parallel, which might speed the process up a bit.

广义上讲,这是一个简单的脚本(该脚本仅按计数而不是按大小划分文件,并且不尝试并行化).

In broad terms, a simple script (which just divides the files by count, not by size, and which doesn't attempt to parallelize).

# Requires bash 4 and Gnu grep
shopt -s globstar
files=(**)
total=${#files[@]}
for ((i=0; i<total; i+=100)); do
  echo $i/$total >>/dev/stderr
  grep -d skip -e "$pattern" "${files[@]:i:100}" >>results.txt
done

为简单起见,我使用globstar(**)将所有文件安全地放置在数组中.如果您的bash版本太旧,则可以通过循环find的输出来实现,但是如果有很多文件,效率不是很高.不幸的是,我没有办法编写只匹配文件的globstar表达式. (**/仅匹配目录.)幸运的是,GNU grep提供了-d skip选项,该选项以静默方式跳过目录.这意味着文件计数将略有不准确,因为将对目录进行计数,但这可能并没有多大区别.

For simplicity, I use a globstar (**) to safely put all the files in an array. If your version of bash is too old, then you can do it by looping over the output of find, but that's not very efficient if you have lots of files. Unfortunately, there is no way that I know of to write a globstar expression which only matches files. (**/ only matches directories.) Fortunately, GNU grep provides the -d skip option which silently skips directories. That means that the file count will be slightly inaccurate, since directories will be counted, but it probably doesn't make much difference.

您可能希望通过使用一些控制台代码来使进度报告更简洁.以上只是为了帮助您入门.

You probably will want to make the progress report cleaner by using some console codes. The above is just to get you started.

将其划分为不同进程的最简单方法是将列表划分为X个不同的段,并为循环运行X个不同的段,每个段都有不同的起点.但是,它们可能不会同时全部完成,因此不是最佳选择.更好的解决方案是GNU并行.您可能会执行以下操作:

The simplest way to divide that into different processes would be to just divide the list into X different segments and run X different for loops, each with a different starting point. However, they probably won't all finish at the same time so that is sub-optimal. A better solution is GNU parallel. You might do something like this:

find . -type f -print0 |
parallel --progress -L 100 -m -j 4 grep -e "$pattern" > results.txt

(这里-L 100指定每个grep实例最多应分配100个文件,而-j 4指定四个并行进程.我只是想出了这些数字;您可能想要调整它们. )

(Here -L 100 specifies that up to 100 files should be given to each grep instance, and -j 4 specifies four parallel processes. I just pulled those numbers out of the air; you'll probably want to adjust them.)

这篇关于grep-如何输出进度条或状态的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆