在大型数据集上使用grep或fgrep非常慢的循环 [英] Very slow loop using grep or fgrep on large datasets

查看:553
本文介绍了在大型数据集上使用grep或fgrep非常慢的循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图做一些非常简单的事情; grep从一个列表,一个完全匹配的字符串,在目录中的文件:

  #try grep每一行在$(cat / data / datafile)中为我$文件
;做
LOOK = $(echo $ i);
fgrep -r $ LOOK / data / filestosearch>> /data/output.txt
完成

与grep配对的文件有2000万行,目录有600个文件,总共大约400万行
我可以看到这样会很慢,但是我们估计它将需要7年。即使我们在HPC上使用300个内核来分割文件以进行搜索,看起来可能需要一周时间。

有类似的问题:

循环运行非常慢



或fgrep,这可能会更快(但现在我正在测试它似乎有点慢)
任何人都可以看到更快的方式来做到这一点吗?
预先感谢您

解决方案

听起来像 -f 标志为 grep 在这里适合:

  -f FILE, --file = FILE 
从FILE获取模式,每行一个。空文件
包含零模式,因此不会匹配任何内容。 (-f是由POSIX指定的
)。

so grep 已经可以做你的循环所做的事情了,你可以用下面的代码来替换循环:

  grep  - F -r -f / data / datafile / data / filestosearch>> /data/output.txt 



<现在我不确定2000万个模式的表现,但至少你并没有以这种方式启动2000万个进程,所以它可能会显着加快。


I’m trying to do something pretty simple; grep from a list, an exact match for the string, on the files in a directory:

#try grep each line from the files
for i in $(cat /data/datafile); do 
LOOK=$(echo $i);
fgrep -r $LOOK /data/filestosearch >>/data/output.txt
done

The file with the matches to grep with has 20 million lines, and the directory has ~600 files, with a total of ~40Million lines I can see that this is going to be slow but we estimated it will take 7 years. Even if I use 300 cores on our HPC splitting the job by files to search, it looks like it could take over a week.

there are similar questions:

Loop Running VERY Slow :

Very slow foreach loop

here and although they are on different platforms, I think possibly if else might help me. or fgrep which is potentially faster (but seems to be a bit slow as I'm testing it now) Can anyone see a faster way to do this? Thank you in advance

解决方案

sounds like the -f flag for grep would be suitable here:

-f FILE, --file=FILE
    Obtain  patterns  from  FILE,  one  per  line.   The  empty file
    contains zero patterns, and therefore matches nothing.   (-f  is
    specified by POSIX.)

so grep can already do what your loop is doing, and you can replace the loop with:

grep -F -r -f /data/datafile /data/filestosearch >>/data/output.txt

Now I'm not sure about the performance of 20 million patterns, but at least you aren't starting 20 million processes this way so it's probably significantly faster.

这篇关于在大型数据集上使用grep或fgrep非常慢的循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆