Python grep代码比命令行的grep慢得多 [英] Python grep code much slower than command line's grep

查看:702
本文介绍了Python grep代码比命令行的grep慢得多的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只是为模式 approved =no挑选了一些Xliff文件。我有一个Shell脚本和一个Python脚本,性能差异很大(对于一组393个文件,总共3,686,329行,Shell脚本的0.1s用户时间和Python脚本的6.6s)。


$ b

Shell: grep'approved =no'FILE

Python:

  def grep(pattern,file_path):
ret = False

with codecs.open( file_path,r,encoding =utf-8)为f:
,而不是ret:
lines = f.readlines(100000)
如果不是行:
在行中打破

如果re.search(pattern,line):
ret = True
break
return ret

通过多平台解决方案提高绩效的任何想法?

h2>

以下是应用一些建议的解决方案后的一些结果。

测试运行在RHEL6 Linux机器上,使用Python 2.6.6。

工作集:393个Xliff文件,总计3,686,329行。

数字是以秒为单位的用户时间。

$ (io,加入100,000个文件行):50s

grep_3 (mmap):0.7s $ b

grep_1 br>
Shell版(Linux grep):0.130s

解决方案

解释型语言与编译后的C版本 grep 总是会比较慢。

除此之外,您的Python实现不是 grep 例子相同。它不返回匹配的行,它只是测试以查看模式是否匹配任何一行上的字符。仔细比较一下:

  grep -q'approved =no'FILE 

只要找到匹配就会返回,并且不会产生任何输出。



您可以通过更有效地编写 grep()函数来大幅加速代码:

  def grep_1(pattern,file_path):
with io.open(file_path,r,encoding =utf-8)as f:
while True:
lines = f.readlines(100000)
如果不是行:
返回False
如果re.search(pattern,''.join(lines)):
return True

它使用 io 而不是编解码器,我发现它有点快。 while循环条件不需要检查 ret ,并且只要知道结果就可以从函数返回。没有必要针对每个ilne运行 re.search() - 只需加入行并执行一次搜索即可。



以内存使用为代价,你可以试试这个:

  import io 

def grep_2 (pattern,file.path):
with io.open(file_path,r,encoding =utf-8)as f:
return re.search(pattern,f.read())

如果内存有问题,您可以 mmap 文件并在 mmap 中运行正则表达式搜索:

  import io 
导入mmap

def grep_3(pattern,file_path):
with io.open(file_path,r,encoding =utf-8)as f:
返回re.search(pattern,mmap.mmap(f.fileno(),0,access = mmap.ACCESS_READ))

mmap 可以高效地从页面中的文件中读取数据,而不会消耗大量内存。此外,您可能会发现 mmap 的运行速度比其他解决方案快。






对这些函数使用 timeit 表明这种情况:

 
10个循环,最好的3:每循环639毫秒#grep()
10个循环,最好的3:每循环78.7毫秒#grep_1()
10个循环,最好的3:19.4 msec per loop#grep_2()
100个循环,最好是3:每循环5.32毫秒#grep_3()

/ usr / share / dict / words ,其中包含约480,000行,搜索模式为 zymurgies 文件结尾。为了比较,当模式接近文件的开始时,例如 abaciscus ,时间为:

 
10个循环,最好是3:62.6毫秒每个循环#grep()
1000循环,最好是3:每循环1.6毫秒#grep_1()
100个循环,最好是3:每循环14.2毫秒#grep_2()
10000个循环,最好是3:每循环37.2次使用#grep_3()

它再次表明 mmap 版本是最快的。






现在比较 grep 命令与Python mmap 版本:

  $ time grep -q zymurgies / usr / share / dict / words 

real 0m0.0s
user 0m0.007s
sys 0m0.003s

$ time python x。 py grep_3#uses mmap

real 0m0.023s
user 0m0.019s
sys 0m0.004s

考虑到 grep 的优点,这并不算太坏。


I'm just grepping some Xliff files for the pattern approved="no". I have a Shell script and a Python script, and the difference in performance is huge (for a set of 393 files, and a total of 3,686,329 lines, 0.1s user time for the Shell script, and 6.6s for the Python script).

Shell: grep 'approved="no"' FILE
Python:

def grep(pattern, file_path):
    ret = False

    with codecs.open(file_path, "r", encoding="utf-8") as f:
        while 1 and not ret:
            lines = f.readlines(100000)
            if not lines:
                break
            for line in lines:
                if re.search(pattern, line):
                    ret = True
                    break
    return ret

Any ideas to improve performance with a multiplatform solution?

Results

Here are a couple of results after applying some of the proposed solutions.
Tests were run on a RHEL6 Linux machine, with Python 2.6.6.
Working set: 393 Xliff files, 3,686,329 lines in total.
Numbers are user time in seconds.

grep_1 (io, joining 100,000 file lines): 50s
grep_3 (mmap): 0.7s
Shell version (Linux grep): 0.130s

解决方案

Python, being an interpreted language vs. a compiled C version of grep will always be slower.

Apart from that your Python implementation is not the same as your grep example. It is not returning the matching lines, it is merely testing to see if the pattern matches the characters on any one line. A closer comparison would be:

grep -q 'approved="no"' FILE

which will return as soon as a match is found and not produce any output.

You can substantially speed up your code by writing your grep() function more efficiently:

def grep_1(pattern, file_path):
    with io.open(file_path, "r", encoding="utf-8") as f:
        while True:
            lines = f.readlines(100000)
            if not lines:
                return False
            if re.search(pattern, ''.join(lines)):
                return True

This uses io instead of codecs which I found was a little faster. The while loop condition does not need to check ret and you can return from the function as soon as the result is known. There's no need to run re.search() for each individual ilne - just join the lines and perform a single search.

At the cost of memory usage you could try this:

import io

def grep_2(pattern, file_path):
    with io.open(file_path, "r", encoding="utf-8") as f:
        return re.search(pattern, f.read())

If memory is an issue you could mmap the file and run the regex search on the mmap:

import io
import mmap

def grep_3(pattern, file_path):
    with io.open(file_path, "r", encoding="utf-8") as f:
        return re.search(pattern, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ))

mmap will efficiently read the data from the file in pages without consuming a lot of memory. Also, you'll probably find that mmap runs faster than the other solutions.


Using timeit for each of these functions shows that this is the case:

10 loops, best of 3: 639 msec per loop       # grep()
10 loops, best of 3: 78.7 msec per loop      # grep_1()
10 loops, best of 3: 19.4 msec per loop      # grep_2()
100 loops, best of 3: 5.32 msec per loop     # grep_3()

The file was /usr/share/dict/words containing approx 480,000 lines and the search pattern was zymurgies, which occurs near the end of the file. For comparison, when pattern is near the start of the file, e.g. abaciscus, the times are:

10 loops, best of 3: 62.6 msec per loop       # grep()
1000 loops, best of 3: 1.6 msec per loop      # grep_1()
100 loops, best of 3: 14.2 msec per loop      # grep_2()
10000 loops, best of 3: 37.2 usec per loop    # grep_3()

which again shows that the mmap version is fastest.


Now comparing the grep command with the Python mmap version:

$ time grep -q zymurgies /usr/share/dict/words

real    0m0.010s
user    0m0.007s
sys 0m0.003s

$ time python x.py grep_3    # uses mmap

real    0m0.023s
user    0m0.019s
sys 0m0.004s

Which is not too bad considering the advantages that grep has.

这篇关于Python grep代码比命令行的grep慢得多的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆