Python grep 代码比命令行的 grep 慢得多 [英] Python grep code much slower than command line's grep

查看:27
本文介绍了Python grep 代码比命令行的 grep 慢得多的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只是为 approved="no" 模式搜索一些 Xliff 文件.我有一个Shell脚本和一个Python脚本,性能差异很大(一组393个文件,总共3686329行,Shell脚本0.1s用户时间,Python脚本6.6s).

外壳:grep 'approved="no"' FILE
蟒蛇:

def grep(pattern, file_path):ret = 错误使用 codecs.open(file_path, "r", encoding="utf-8") 作为 f:而 1 而不是 ret:行 = f.readlines(100000)如果不是行:休息对于线中线:如果研究(模式,行):ret = 真休息返回 ret

是否有使用多平台解决方案提高性能的想法?

结果

以下是应用一些建议的解决方案后的一些结果.
测试在 RHEL6 Linux 机器上运行,使用 Python 2.6.6.
工作集:393 个 Xliff 文件,总共 3,686,329 行.
数字是以秒为单位的用户时间.

grep_1(io,加入 100,000 行文件):50s
grep_3(mmap):0.7s
外壳版本(Linux grep):0.130s

解决方案

Python,作为一种解释型语言,与编译后的 C 版本 grep 相比,总是会比较慢.

除此之外,您的 Python 实现与您的 grep 示例不同.它不返回匹配的行,它只是测试模式是否与任何一行上的字符匹配.更仔细的比较是:

grep -q 'approved="no"' 文件

一旦找到匹配项就会返回并且不产生任何输出.

通过更有效地编写 grep() 函数,您可以显着加快代码速度:

def grep_1(pattern, file_path):使用 io.open(file_path, "r", encoding="utf-8") 作为 f:而真:行 = f.readlines(100000)如果不是行:返回错误如果 re.search(pattern, ''.join(lines)):返回真

这使用 io 而不是 codecs ,我发现它要快一点.while 循环条件不需要检查 ret 并且您可以在知道结果后立即从函数中返回.无需为每个单独的 ilne 运行 re.search() - 只需加入行并执行单个搜索.

以内存使用为代价,你可以试试这个:

导入iodef grep_2(模式,文件路径):使用 io.open(file_path, "r", encoding="utf-8") 作为 f:返回 re.search(pattern, f.read())

如果内存有问题,您可以 mmap 文件并在 mmap 上运行正则表达式搜索:

导入io导入映射def grep_3(模式,文件路径):使用 io.open(file_path, "r", encoding="utf-8") 作为 f:返回 re.search(pattern, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ))

mmap 将有效地从文件中读取页面中的数据,而不会消耗大量内存.此外,您可能会发现 mmap 比其他解决方案运行得更快.

<小时>

对这些函数中的每一个使用 timeit 表明情况确实如此:

<前>10 个循环,最好的 3 个:每个循环 639 毫秒 # grep()10 个循环,最好的 3 个:每个循环 78.7 毫秒 # grep_1()10 个循环,最好的 3 个:每个循环 19.4 毫秒 # grep_2()100 个循环,最好的 3 个:每个循环 5.32 毫秒 # grep_3()

该文件是 /usr/share/dict/words,包含大约 480,000 行,搜索模式是 zymurgies,它出现在文件末尾附近.为了进行比较,当模式接近文件的开头时,例如算盘,时间是:

<前>10 个循环,最好的 3 个:每个循环 62.6 毫秒 # grep()1000 个循环,最好的 3 个:每个循环 1.6 毫秒 # grep_1()100 个循环,最好的 3 个:每个循环 14.2 毫秒 # grep_2()10000 个循环,3 个最佳:每个循环 37.2 usec # grep_3()

这再次表明 mmap 版本是最快的.

<小时>

现在将 grep 命令与 Python mmap 版本进行比较:

$ time grep -q zymurgies/usr/share/dict/words真实 0m0.010s用户 0m0.007s系统 0m0.003s$ time python x.py grep_3 # 使用 mmap真实 0m0.023s用户 0m0.019s系统 0m0.004s

考虑到 grep 的优势,这还算不错.

I'm just grepping some Xliff files for the pattern approved="no". I have a Shell script and a Python script, and the difference in performance is huge (for a set of 393 files, and a total of 3,686,329 lines, 0.1s user time for the Shell script, and 6.6s for the Python script).

Shell: grep 'approved="no"' FILE
Python:

def grep(pattern, file_path):
    ret = False

    with codecs.open(file_path, "r", encoding="utf-8") as f:
        while 1 and not ret:
            lines = f.readlines(100000)
            if not lines:
                break
            for line in lines:
                if re.search(pattern, line):
                    ret = True
                    break
    return ret

Any ideas to improve performance with a multiplatform solution?

Results

Here are a couple of results after applying some of the proposed solutions.
Tests were run on a RHEL6 Linux machine, with Python 2.6.6.
Working set: 393 Xliff files, 3,686,329 lines in total.
Numbers are user time in seconds.

grep_1 (io, joining 100,000 file lines): 50s
grep_3 (mmap): 0.7s
Shell version (Linux grep): 0.130s

解决方案

Python, being an interpreted language vs. a compiled C version of grep will always be slower.

Apart from that your Python implementation is not the same as your grep example. It is not returning the matching lines, it is merely testing to see if the pattern matches the characters on any one line. A closer comparison would be:

grep -q 'approved="no"' FILE

which will return as soon as a match is found and not produce any output.

You can substantially speed up your code by writing your grep() function more efficiently:

def grep_1(pattern, file_path):
    with io.open(file_path, "r", encoding="utf-8") as f:
        while True:
            lines = f.readlines(100000)
            if not lines:
                return False
            if re.search(pattern, ''.join(lines)):
                return True

This uses io instead of codecs which I found was a little faster. The while loop condition does not need to check ret and you can return from the function as soon as the result is known. There's no need to run re.search() for each individual ilne - just join the lines and perform a single search.

At the cost of memory usage you could try this:

import io

def grep_2(pattern, file_path):
    with io.open(file_path, "r", encoding="utf-8") as f:
        return re.search(pattern, f.read())

If memory is an issue you could mmap the file and run the regex search on the mmap:

import io
import mmap

def grep_3(pattern, file_path):
    with io.open(file_path, "r", encoding="utf-8") as f:
        return re.search(pattern, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ))

mmap will efficiently read the data from the file in pages without consuming a lot of memory. Also, you'll probably find that mmap runs faster than the other solutions.


Using timeit for each of these functions shows that this is the case:

10 loops, best of 3: 639 msec per loop       # grep()
10 loops, best of 3: 78.7 msec per loop      # grep_1()
10 loops, best of 3: 19.4 msec per loop      # grep_2()
100 loops, best of 3: 5.32 msec per loop     # grep_3()

The file was /usr/share/dict/words containing approx 480,000 lines and the search pattern was zymurgies, which occurs near the end of the file. For comparison, when pattern is near the start of the file, e.g. abaciscus, the times are:

10 loops, best of 3: 62.6 msec per loop       # grep()
1000 loops, best of 3: 1.6 msec per loop      # grep_1()
100 loops, best of 3: 14.2 msec per loop      # grep_2()
10000 loops, best of 3: 37.2 usec per loop    # grep_3()

which again shows that the mmap version is fastest.


Now comparing the grep command with the Python mmap version:

$ time grep -q zymurgies /usr/share/dict/words

real    0m0.010s
user    0m0.007s
sys 0m0.003s

$ time python x.py grep_3    # uses mmap

real    0m0.023s
user    0m0.019s
sys 0m0.004s

Which is not too bad considering the advantages that grep has.

这篇关于Python grep 代码比命令行的 grep 慢得多的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆