读取文件搜索模式匹配的最快方法 [英] Fastest way to read file searching for pattern matches

查看:584
本文介绍了读取文件搜索模式匹配的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我制作了一个python脚本来分析日志。我有一个观点要分享,还有两个问题要问。



当我使用gzip.open打开每个文件并遍历每一行时,只需经过大约200秒即可浏览所有行和文件。

 用gzip.open(file)作为fp:
用于fp中的行:
pass

如果使用 zcat grep 来完成这项工作,大约需要50秒。

  temp = commands.getstatusoutput(zcat file * grep模式)

性能差异太大而无法忽略。有没有更好的方法来缩小差距?



我还注意到,命令模块已被过时 subprocess 模块,它似乎总是创建一个临时文件。但是如果不能从python脚本运行的地方创建一个临时文件会怎么样呢?任何建议?

解决方案

'grep'包含十年的优化,并以任何编程语言重新实现它,而不仅仅是Python ,会变慢。 * 1



因此,如果速度对你很重要,那么你直接调用'grep'的技术可能是一种方法。要使用'subprocess'来执行此操作,而不必编写任何临时文件,请使用'subprocess.PIPE'机制:

  from子进程导入Popen,PIPE 

COMMAND ='zcat文件* | grep oldconfig'
process = Popen(COMMAND,shell = True,stderr = PIPE,stdout = PIPE)
output,errors = process.communicate()
assert process.returncode == 0, process.returncode
assert errors =='',errors
print('{} lines match'.format(len(output.splitlines())))

这适用于Python3.5。我避免使用最近在子进程之上添加的任何更高级别的接口,所以它应该也适用于老版本的Python。






(* 1例如,即使是空的'for'循环,就像你在问题中显示的一样,grep可能仍然会更快,因为它不会逐行读取输入。相反,它决定了它可以通过文件查找的最大字符数,完全忽略换行符,在每次查找后读取一个字符,搜索可能匹配正则表达式的任何部分的字符。只有当它找到匹配时,才会查看围绕匹配的字符,查看是否存在正则表达式匹配的其余部分以及适当的换行符,最重要的是它动态生成的代码是硬编码的,用于检查给定正则表达式的匹配,意味着它执行大约3个x86指令它检查的每个输入字节,并跳过检查大多数输入字节)。

I made a python script to analyze logs. I have one observation to share, and two questions to ask.

When I use gzip.open to open each file and go through every line, it takes around 200 seconds just to going through all the lines and files.

with gzip.open(file) as fp:
    for line in fp:
          pass

If using zcat and grep to do the work, it takes about 50 seconds.

temp = commands.getstatusoutput("zcat file* | grep pattern")

The performance difference is too huge to ignore. Is there a better way to reduce the gap?

I also noticed that the commands module is made obsolete by the subprocess module, which seems always create a temporary file. But it wouldn't be convenient, what if it is not possible to create a temporary file from where the python script is running? Any suggestion?

解决方案

'grep' contains decade's worth of optimizations, and re-implementing it in any programming language, not just Python, will be slower. *1

Therefore, if speed is important to you, your technique of calling 'grep' directly is probably the way to go. To do this using 'subprocess', without having to write any temporary files, use the 'subprocess.PIPE' mechanism:

from subprocess import Popen, PIPE

COMMAND = 'zcat file* | grep oldconfig'
process = Popen(COMMAND, shell=True, stderr=PIPE, stdout=PIPE)
output, errors = process.communicate()
assert process.returncode == 0, process.returncode
assert errors == '', errors
print('{} lines match'.format(len(output.splitlines())))

This works for me on Python3.5. I've avoided using any of the higher-level interfaces added on top of subprocess recently, so it should work fine on older versions of Python too.


(*1 for example, even with an empty 'for' loop, as you show in your question, grep is likely to still be faster, because it does not read the input line-by-line. Instead it determine the max number of characters it can seek forwards through the file, ignoring newlines completely, reading one char after each seek, searching for chars that might match any part of the regex. Only if it finds a match does it then look at the characters surrounding that match, to see if the rest of the regex matches and appropriate newlines are present. On top of that it dynamically generates code that is hard-coded to check for matches to the given regex, meaning it executes around 3 x86 instructions per input byte that it examines, and it skips examining most input bytes completely)

这篇关于读取文件搜索模式匹配的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆