在目录的文件内容中搜索的最快方法 [英] Fastest approach to search within file contents of a directory

查看:171
本文介绍了在目录的文件内容中搜索的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个目录,其中包含我所拥有程序的用户的文件.该目录中大约有70k json文件.

I got a directory that contains files for users of a program I have. There are around 70k json files in that directory.

当前搜索方法正在使用globforeach.它变得很慢并且占用服务器.有什么好的方法可以更有效地搜索这些文件吗?我正在Ubuntu 16.04机器上运行它,并且可以根据需要使用exec.

The current search method is using glob and foreach. It's getting quite slow and hogging the server. Is there any good way to search through these files more efficiently? I'm running this on a Ubuntu 16.04 machine and I can use exec if needed.

更新:

这些是json文件,需要打开每个文件以检查其是否包含搜索查询.循环遍历文件非常快,但是当需要打开每个文件时,需要花费相当长的时间.

Theses are json files and each file needs to be opened to check if it contains the search query or not. Looping over the files is quite fast, but when it needs to open each file, it takes quite a while.

这些无法使用SQL或memcached进行索引,因为我将memcached用于其他用途.

These cannot be indexed using SQL or memcached, as I'm using memcached for some other things.

推荐答案

正如您所暗示的那样,要使其成为最高效的搜索,您需要将任务交给为此目的而设计的工具.

As you implied yourself, to make this the most performant search possible, you need to hand over the task to a tool that is designed for this purpose.

我说,超越grep ,甚至可以看到 ag ,然后选择

I say, go beyond grep and see what's even better than ack. Also, see ag and then settle for ripgrep as it's the best of its kind in the town.

我在低规格笔记本电脑上对ack做了一些实验.我在 19,501 文件中搜索了现有的班级名称.结果如下:

I did a little experiment with ack on a low-spec laptop. I searched for an existing class name within 19,501 files. Here's the results:

$ cd ~/Dev/php/packages
$ ack -f | wc -l 
19501

$ time ack PHPUnitSeleniumTestCase | wc -l
10
ack PHPUnitSeleniumTestCase  7.68s user 2.99s system 21% cpu 48.832 total
wc -l  0.00s user 0.00s system 0% cpu 48.822 total

这次,我使用 ag 做了相同的实验.真的让我感到惊讶:

I did the same experiment, this time with ag. And it really surprised me:

$ time ag PHPUnitSeleniumTestCase | wc -l
10
ag PHPUnitSeleniumTestCase  0.24s user 0.98s system 13% cpu 9.379 total
wc -l  0.00s user 0.00s system 0% cpu 9.378 total

我对结果感到非常兴奋,我继续尝试了 ripgrep .更好:

I was so excited with the results, I went on and tried ripgrep as well. Even better:

$ time rg PHPUnitSeleniumTestCase | wc -l
10
rg PHPUnitSeleniumTestCase  0.44s user 0.27s system 19% cpu 3.559 total
wc -l  0.00s user 0.00s system 0% cpu 3.558 total

使用该系列工具进行实验,看看最适合您的需求.

Experiment with this family of tools, see what best suits your needs.

PS ripgrep原始作者已离开 ripgrep的速度比{grep,ag,git grep,ucg,pt,sift} 快.有趣的阅​​读,很棒的作品.

P.S. ripgrep's original author has left a comment under this post, saying that ripgrep is faster than {grep, ag, git grep, ucg, pt, sift}. Interesting read, fabulous work.

这篇关于在目录的文件内容中搜索的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆