在目录的文件内容中搜索的最快方法 [英] Fastest approach to search within file contents of a directory
问题描述
我有一个目录,其中包含我所拥有程序的用户的文件.该目录中大约有70k json文件.
I got a directory that contains files for users of a program I have. There are around 70k json files in that directory.
当前搜索方法正在使用glob
和foreach
.它变得很慢并且占用服务器.有什么好的方法可以更有效地搜索这些文件吗?我正在Ubuntu 16.04机器上运行它,并且可以根据需要使用exec
.
The current search method is using glob
and foreach
. It's getting quite slow and hogging the server. Is there any good way to search through these files more efficiently? I'm running this on a Ubuntu 16.04 machine and I can use exec
if needed.
更新:
这些是json文件,需要打开每个文件以检查其是否包含搜索查询.循环遍历文件非常快,但是当需要打开每个文件时,需要花费相当长的时间.
Theses are json files and each file needs to be opened to check if it contains the search query or not. Looping over the files is quite fast, but when it needs to open each file, it takes quite a while.
这些无法使用SQL或memcached进行索引,因为我将memcached用于其他用途.
These cannot be indexed using SQL or memcached, as I'm using memcached for some other things.
推荐答案
正如您所暗示的那样,要使其成为最高效的搜索,您需要将任务交给为此目的而设计的工具.
As you implied yourself, to make this the most performant search possible, you need to hand over the task to a tool that is designed for this purpose.
I say, go beyond grep
and see what's even better than ack
. Also, see ag
and then settle for ripgrep
as it's the best of its kind in the town.
我在低规格笔记本电脑上对ack
做了一些实验.我在 19,501 文件中搜索了现有的班级名称.结果如下:
I did a little experiment with ack
on a low-spec laptop. I searched for an existing class name within 19,501 files. Here's the results:
$ cd ~/Dev/php/packages
$ ack -f | wc -l
19501
$ time ack PHPUnitSeleniumTestCase | wc -l
10
ack PHPUnitSeleniumTestCase 7.68s user 2.99s system 21% cpu 48.832 total
wc -l 0.00s user 0.00s system 0% cpu 48.822 total
这次,我使用 ag
做了相同的实验.真的让我感到惊讶:
I did the same experiment, this time with ag
. And it really surprised me:
$ time ag PHPUnitSeleniumTestCase | wc -l
10
ag PHPUnitSeleniumTestCase 0.24s user 0.98s system 13% cpu 9.379 total
wc -l 0.00s user 0.00s system 0% cpu 9.378 total
我对结果感到非常兴奋,我继续尝试了 ripgrep
.更好:
I was so excited with the results, I went on and tried ripgrep
as well. Even better:
$ time rg PHPUnitSeleniumTestCase | wc -l
10
rg PHPUnitSeleniumTestCase 0.44s user 0.27s system 19% cpu 3.559 total
wc -l 0.00s user 0.00s system 0% cpu 3.558 total
使用该系列工具进行实验,看看最适合您的需求.
Experiment with this family of tools, see what best suits your needs.
PS ripgrep
的原始作者已离开 ripgrep的速度比{grep,ag,git grep,ucg,pt,sift} 快.有趣的阅读,很棒的作品.
P.S. ripgrep
's original author has left a comment under this post, saying that ripgrep is faster than {grep, ag, git grep, ucg, pt, sift}. Interesting read, fabulous work.
这篇关于在目录的文件内容中搜索的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!