在大型数据集上使用grep或fgrep非常慢的循环 [英] Very slow loop using grep or fgrep on large datasets

查看：553 发布时间：2018/5/28 19:29:39 bash loops grep

本文介绍了在大型数据集上使用grep或fgrep非常慢的循环的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图做一些非常简单的事情; grep从一个列表，一个完全匹配的字符串，在目录中的文件：

  #try grep每一行在$（cat / data / datafile）中为我$文件
;做
 LOOK = $（echo $ i）; 
 fgrep -r $ LOOK / data / filestosearch>> /data/output.txt 
完成

与grep配对的文件有2000万行，目录有600个文件，总共大约400万行
我可以看到这样会很慢，但是我们估计它将需要7年。即使我们在HPC上使用300个内核来分割文件以进行搜索，看起来可能需要一周时间。

有类似的问题：

循环运行非常慢
：

或fgrep，这可能会更快（但现在我正在测试它似乎有点慢）
任何人都可以看到更快的方式来做到这一点吗？
预先感谢您

解决方案

听起来像 -f 标志为 grep 在这里适合：

  -f FILE， --file = FILE 
从FILE获取模式，每行一个。空文件
包含零模式，因此不会匹配任何内容。 （-f是由POSIX指定的
）。

so grep 已经可以做你的循环所做的事情了，你可以用下面的代码来替换循环：

  grep  - F -r -f / data / datafile / data / filestosearch>> /data/output.txt

<现在我不确定2000万个模式的表现，但至少你并没有以这种方式启动2000万个进程，所以它可能会显着加快。

 
I’m trying to do something pretty simple; grep from a list, an exact match for the string, on the files in a directory:
#try grep each line from the files
for i in $(cat /data/datafile); do 
LOOK=$(echo $i);
fgrep -r $LOOK /data/filestosearch >>/data/output.txt
done
The file with the matches to grep with has 20 million lines, and the directory has ~600 files, with a total of ~40Million lines
I can see that this is going to be slow but we estimated it will take 7 years. Even if I use 300 cores on our HPC splitting the job by files to search, it looks like it could take over a week. 

there are similar questions:

Loop Running VERY Slow
:

Very slow foreach loop

here and although they are on different platforms, I think possibly if else might help me.
or fgrep which is potentially faster (but seems to be a bit slow as I'm testing it now)
Can anyone see a faster way to do this?
Thank you in advance
 解决方案 
sounds like the -f flag for grep would be suitable here:
-f FILE, --file=FILE
    Obtain  patterns  from  FILE,  one  per  line.   The  empty file
    contains zero patterns, and therefore matches nothing.   (-f  is
    specified by POSIX.)
so grep can already do what your loop is doing, and you can replace the loop with:
grep -F -r -f /data/datafile /data/filestosearch >>/data/output.txt
Now I'm not sure about the performance of 20 million patterns, but at least you aren't starting 20 million processes this way so it's probably significantly faster.

                        这篇关于在大型数据集上使用grep或fgrep非常慢的循环的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

在大型数据集上使用grep或fgrep非常慢的循环 [英] Very slow loop using grep or fgrep on large datasets

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在大型数据集上使用grep或fgrep非常慢的循环 [英] Very slow loop using grep or fgrep on large datasets

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭