在很大的文件中快速搜索字符串 [英] Fast string search in a very large file

查看：139 发布时间：2020/4/23 10:40:56 linux bash grep

本文介绍了在很大的文件中快速搜索字符串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在包含字符串的文件中搜索行的最快方法是什么.我有一个包含要搜索的字符串的文件.这个小文件(smallF)包含约50,000行，如下所示:

stringToSearch1
stringToSearch2
stringToSearch3

我必须在一个较大的文件(大约 1亿行)中搜索所有这些字符串.如果此较大文件中的任何行包含搜索字符串，则将打印该行.

到目前为止，我想出的最好的方法是

grep -F -f smallF largeF

但这不是很快.在smallF中只有100个搜索字符串，大约需要4分钟.对于50,000多个搜索字符串，将花费大量时间.

有没有更有效的方法?

我曾经注意到使用-E或多个-e参数比使用-f更快.请注意，这可能不适用于您的问题，因为您正在较大的文件中搜索50,000个字符串.但是，我想向您展示可以做的事情以及可能值得测试的地方:

这是我详细注意到的:

已将1.2GB的文件填充为随机字符串.

>ls -has | grep string
1,2G strings.txt

>head strings.txt
Mfzd0sf7RA664UVrBHK44cSQpLRKT6J0
Uk218A8GKRdAVOZLIykVc0b2RH1ayfAy
BmuCCPJaQGhFTIutGpVG86tlanW8c9Pa
etrulbGONKT3pact1SHg2ipcCr7TZ9jc
.....

现在，我想使用不同的grep方法搜索字符串"ab"，"cd"和"ef":

使用不带标志的grep，一次搜索一个:

grep"ab" strings.txt> m1.out
2,76s用户0,42s系统96％cpu 3,313总计

grep"cd" strings.txt >> m1.out
2,82s用户0,36s系统95％cpu 3,322总计

grep"ef" strings.txt >> m1.out
2,78s用户0,36s系统94％cpu总3,360

因此，搜索总共要花费 10秒 .

将带有-f标志的grep与search.txt中的搜索字符串一起使用

>cat search.txt
 ab
 cd
 ef

>grep -F -f search.txt strings.txt > m2.out  
31,55s user 0,60s system 99% cpu 32,343 total

由于某些原因，这需要将近 32秒 .

现在通过-e

使用多种搜索模式

grep -E "ab|cd|ef" strings.txt > m3.out  
3,80s user 0,36s system 98% cpu 4,220 total

或

grep --color=auto -e "ab" -e "cd" -e "ef" strings.txt > /dev/null  
3,86s user 0,38s system 98% cpu 4,323 total

使用-E的第三种方法仅需 4.22秒 即可搜索文件.

现在让我们检查结果是否相同:

cat m1.out | sort | uniq > m1.sort  
cat m3.out | sort | uniq > m3.sort
diff m1.sort m3.sort
#

差异不产生输出，这意味着找到的结果是相同的.

也许想尝试一下，否则我建议您看一下最快的grep"线程，请参阅Cyrus的评论.

What is the fastest method for searching lines in a file containing a string. I have a file containing strings to search. This small file (smallF) contains about 50,000 lines and looks like:

stringToSearch1
stringToSearch2
stringToSearch3

I have to search all of these strings in a larger file (about 100 million lines). If any line in this larger file contains the search string the line is printed.

The best method I have come up with so far is

grep -F -f smallF largeF

But this is not very fast. With just 100 search strings in smallF it takes about 4 minutes. For over 50,000 search strings it will take a lot of time.

Is there a more efficient method?

解决方案

I once noticed that using -E or multiple -e parameters is faster than using -f. Note that this might not be applicable for your problem as you are searching for 50,000 string in a larger file. However I wanted to show you what can be done and what might be worth testing:

Here is what I noticed in detail:

Have 1.2GB file filled with random strings.

>ls -has | grep string
1,2G strings.txt

>head strings.txt
Mfzd0sf7RA664UVrBHK44cSQpLRKT6J0
Uk218A8GKRdAVOZLIykVc0b2RH1ayfAy
BmuCCPJaQGhFTIutGpVG86tlanW8c9Pa
etrulbGONKT3pact1SHg2ipcCr7TZ9jc
.....

Now I want to search for strings "ab", "cd" and "ef" using different grep approaches:

Using grep without flags, search one at a time:

grep "ab" strings.txt > m1.out
2,76s user 0,42s system 96% cpu 3,313 total

grep "cd" strings.txt >> m1.out
2,82s user 0,36s system 95% cpu 3,322 total

grep "ef" strings.txt >> m1.out
2,78s user 0,36s system 94% cpu 3,360 total

So in total the search takes nearly 10 seconds.

Using grep with -f flag with search strings in search.txt

>cat search.txt
 ab
 cd
 ef

>grep -F -f search.txt strings.txt > m2.out  
31,55s user 0,60s system 99% cpu 32,343 total

For some reasons this takes nearly 32 seconds.

Now using multiple search patterns with -e

grep -E "ab|cd|ef" strings.txt > m3.out  
3,80s user 0,36s system 98% cpu 4,220 total

grep --color=auto -e "ab" -e "cd" -e "ef" strings.txt > /dev/null  
3,86s user 0,38s system 98% cpu 4,323 total

The third methode using -E only took 4.22 seconds to search through the file.

Now lets check if the results are the same:

cat m1.out | sort | uniq > m1.sort  
cat m3.out | sort | uniq > m3.sort
diff m1.sort m3.sort
#

The diff produces no output, which means the found results are the same.

Maybe want to give it a try, otherwise I would advise you to look at the thread "Fastest possible grep", see comment from Cyrus.

这篇关于在很大的文件中快速搜索字符串的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在很大的文件中快速搜索字符串 [英] Fast string search in a very large file

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

在很大的文件中快速搜索字符串 [英] Fast string search in a very large file

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭