文本处理 - Python 与 Perl 性能 [英] Text processing - Python vs Perl performance

查看:32
本文介绍了文本处理 - Python 与 Perl 性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的 Perl 和 Python 脚本,用于对大约 21 个日志文件进行一些简单的文本处理,每个文件大约 300 KB 到 1 MB(最大)x 5 次重复(总共 125 个文件,由于 记录重复5次).

Here is my Perl and Python script to do some simple text processing from about 21 log files, each about 300 KB to 1 MB (maximum) x 5 times repeated (total of 125 files, due to the log repeated 5 times).

Python 代码(代码修改为使用编译后的 re 和使用 re.I)

Python Code (code modified to use compiled re and using re.I)

#!/usr/bin/python

import re
import fileinput

exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)

for line in fileinput.input():
    fn = fileinput.filename()
    currline = line.rstrip()

    mprev = exists_re.search(currline)

    if(mprev):
        xlogtime = mprev.group(1)

    mcurr = location_re.search(currline)

    if(mcurr):
        print fn, xlogtime, mcurr.group(1)

Perl 代码

#!/usr/bin/perl

while (<>) {
    chomp;

    if (m/^(.*?) INFO.*Such a record already exists/i) {
        $xlogtime = $1;
    }

    if (m/^AwbLocation (.*?) insert into/i) {
        print "$ARGV $xlogtime $1
";
    }
}

而且,在我的 PC 上,这两个代码生成了 10,790 行的完全相同的结果文件.而且,这是在 Cygwin 的 Perl 和 Python 实现上完成的时间.

And, on my PC both code generates exactly the same result file of 10,790 lines. And, here is the timing done on Cygwin's Perl and Python implementations.

User@UserHP /cygdrive/d/tmp/Clipboard
# time /tmp/scripts/python/afs/process_file.py *log* *log* *log* *log* *log* >
summarypy.log

real    0m8.185s
user    0m8.018s
sys     0m0.092s

User@UserHP /cygdrive/d/tmp/Clipboard
# time /tmp/scripts/python/afs/process_file.pl *log* *log* *log* *log* *log* >
summarypl.log

real    0m1.481s
user    0m1.294s
sys     0m0.124s

对于这个简单的文本处理,最初使用 Python 需要 10.2 秒,使用 Perl 只需 1.9 秒.

Originally, it took 10.2 seconds using Python and only 1.9 secs using Perl for this simple text processing.

(更新)但是,在编译完re 版本的 Python 之后,现在在 Python 中需要 8.2 秒,在 Perl 中需要 1.5 秒.Perl 仍然要快得多.

(UPDATE) but, after the compiled re version of Python, it now takes 8.2 seconds in Python and 1.5 seconds in Perl. Still Perl is much faster.

有没有办法完全提高 Python 的速度,或者很明显 Perl 将是简单文本处理的快速方法.

Is there a way to improve the speed of Python at all OR it is obvious that Perl will be the speedy one for simple text processing.

顺便说一下,这不是我为简单文本处理所做的唯一测试......而且,我制作源代码的每种不同方式,总是 Perl 以较大优势获胜.而且,对于简单的 m/regex/ 匹配和打印内容,Python 没有一次表现得更好.

By the way this was not the only test I did for simple text processing... And, each different way I make the source code, always always Perl wins by a large margin. And, not once did Python performed better for simple m/regex/ match and print stuff.

请不要建议使用 C、C++、Assembly、其他风格的Python等

Please do not suggest to use C, C++, Assembly, other flavours of Python, etc.

我正在寻找使用标准 Python 及其内置的解决方案模块与标准 Perl 进行比较(甚至不使用模块).男孩,由于其可读性,我希望将 Python 用于我的所有任务,但是放弃速度,我不这么认为.

I am looking for a solution using Standard Python with its built-in modules compared against Standard Perl (not even using the modules). Boy, I wish to use Python for all my tasks due to its readability, but to give up speed, I don't think so.

所以,请建议如何改进代码以具有可比性Perl 的结果.

So, please suggest how can the code be improved to have comparable results with Perl.

更新:2012-10-18

正如其他用户所建议的,Perl 有其一席之地,而 Python 也有其一席之地.

As other users suggested, Perl has its place and Python has its.

因此,对于这个问题,可以安全地得出结论,对于数百或数千个文本文件的每一行的简单正则表达式匹配并将结果写入文件(或打印到屏幕),Perl 将始终,始终赢得这项工作的绩效.就这么简单.

So, for this question, one can safely conclude that for simple regex match on each line for hundreds or thousands of text files and writing the results to a file (or printing to screen), Perl will always, always WIN in performance for this job. It as simple as that.

请注意,当我说 Perl 在性能上胜出时……只比较标准的 Perl 和 Python……不诉诸一些晦涩的模块(对于像我这样的普通用户来说是晦涩的),也没有调用 C、C++、汇编来自 Python 或 Perl 的库.我们没有时间为简单的文本匹配工作学习所有这些额外的步骤和安装.

Please note that when I say Perl wins in performance... only standard Perl and Python is compared... not resorting to some obscure modules (obscure for a normal user like me) and also not calling C, C++, assembly libraries from Python or Perl. We don't have time to learn all these extra steps and installation for a simple text matching job.

因此,Perl 非常适合文本处理和正则表达式.

So, Perl rocks for text processing and regex.

Python 在其他地方也有它的一席之地.

Python has its place to rock in other places.

2013-05-29 更新:一篇进行类似比较的优秀文章 在这里.Perl 在简单的文本匹配方面再次获胜......有关更多详细信息,请阅读文章.

Update 2013-05-29: An excellent article that does similar comparison is here. Perl again wins for simple text matching... And for more details, read the article.

推荐答案

这正是 Perl 被设计用来做的事情,所以它更快我并不感到惊讶.

This is exactly the sort of stuff that Perl was designed to do, so it doesn't surprise me that it's faster.

Python 代码中的一个简单优化是预编译这些正则表达式,这样它们就不会每次都被重新编译.

One easy optimization in your Python code would be to precompile those regexes, so they aren't getting recompiled each time.

exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists')
location_re = re.compile(r'^AwbLocation (.*?) insert into')

然后在你的循环中:

mprev = exists_re.search(currline)

mcurr = location_re.search(currline)

这本身不会神奇地使您的 Python 脚本与您的 Perl 脚本保持一致,但在循环中重复调用 re 而不先编译在 Python 中是不好的做法.

That by itself won't magically bring your Python script in line with your Perl script, but repeatedly calling re in a loop without compiling first is bad practice in Python.

这篇关于文本处理 - Python 与 Perl 性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆