在大型数据集上使用grep或fgrep非常慢的循环 [英] Very slow loop using grep or fgrep on large datasets
问题描述
#try grep每一行在$(cat / data / datafile)中为我$文件
;做
LOOK = $(echo $ i);
fgrep -r $ LOOK / data / filestosearch>> /data/output.txt
完成
与grep配对的文件有2000万行,目录有600个文件,总共大约400万行
我可以看到这样会很慢,但是我们估计它将需要7年。即使我们在HPC上使用300个内核来分割文件以进行搜索,看起来可能需要一周时间。
有类似的问题:
循环运行非常慢
:
或fgrep,这可能会更快(但现在我正在测试它似乎有点慢)
任何人都可以看到更快的方式来做到这一点吗?
预先感谢您
听起来像 -f
标志为 grep
在这里适合:
-f FILE, --file = FILE
从FILE获取模式,每行一个。空文件
包含零模式,因此不会匹配任何内容。 (-f是由POSIX指定的
)。
so grep 已经可以做你的循环所做的事情了,你可以用下面的代码来替换循环:
grep - F -r -f / data / datafile / data / filestosearch>> /data/output.txt
<现在我不确定2000万个模式的表现,但至少你并没有以这种方式启动2000万个进程,所以它可能会显着加快。
I’m trying to do something pretty simple; grep from a list, an exact match for the string, on the files in a directory:
#try grep each line from the files
for i in $(cat /data/datafile); do
LOOK=$(echo $i);
fgrep -r $LOOK /data/filestosearch >>/data/output.txt
done
The file with the matches to grep with has 20 million lines, and the directory has ~600 files, with a total of ~40Million lines I can see that this is going to be slow but we estimated it will take 7 years. Even if I use 300 cores on our HPC splitting the job by files to search, it looks like it could take over a week.
there are similar questions:
here and although they are on different platforms, I think possibly if else might help me. or fgrep which is potentially faster (but seems to be a bit slow as I'm testing it now) Can anyone see a faster way to do this? Thank you in advance
sounds like the -f
flag for grep
would be suitable here:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file
contains zero patterns, and therefore matches nothing. (-f is
specified by POSIX.)
so grep
can already do what your loop is doing, and you can replace the loop with:
grep -F -r -f /data/datafile /data/filestosearch >>/data/output.txt
Now I'm not sure about the performance of 20 million patterns, but at least you aren't starting 20 million processes this way so it's probably significantly faster.
这篇关于在大型数据集上使用grep或fgrep非常慢的循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!