Python与Java gzip性能 [英] Python vs. Java gzip performance

查看:194
本文介绍了Python与Java gzip性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个小程序,部分读取文件并解析

。有时,文件被gzip压缩。我用来获取

文件对象的代码是这样的:


if filename.endswith(" .gz"):

file = GzipFile(filename)

else:

file = open(filename)


然后我解析通常方式的文件内容(对于
文件中的行:...)


等效的Java代码如下:


if(isZipped(aFile)){

input = new BufferedReader(new InputStreamReader(new

GZIPInputStream(new FileInputStream(aFile))) ;

} else {

input = new BufferedReader(new FileReader(aFile));

}


然后我解析内容类似于Python版本(同时

nextLine = input.readLine ...)


此代码的Java版本大约比Python版本的bx b版本快2到3倍。我可以通过用os.popen调用gzcat替换Python

GzipFile对象来解决这个问题,但是我牺牲
便携性。有什么东西可以在Python

版本中改进吗?


谢谢 - 比尔。

I''ve written a small program that, in part, reads in a file and parses
it. Sometimes, the file is gzipped. The code that I use to get the
file object is like so:

if filename.endswith(".gz"):
file = GzipFile(filename)
else:
file = open(filename)

Then I parse the contents of the file in the usual way (for line in
file:...)

The equivalent Java code goes like this:

if (isZipped(aFile)) {
input = new BufferedReader(new InputStreamReader(new
GZIPInputStream(new FileInputStream(aFile)));
} else {
input = new BufferedReader(new FileReader(aFile));
}

Then I parse the contents similarly to the Python version (while
nextLine = input.readLine...)

The Java version of this code is roughly 2x-3x faster than the Python
version. I can get around this problem by replacing the Python
GzipFile object with a os.popen call to gzcat, but then I sacrifice
portability. Is there something that can be improved in the Python
version?

Thanks -- Bill.

推荐答案

Bill写道:
此代码的Java版本大约比Python版本快2到3倍。我可以通过对gzcat的os.popen调用替换Python
GzipFile对象来解决这个问题,但后来我牺牲了可移植性。在Python
版本中是否有可以改进的东西?
The Java version of this code is roughly 2x-3x faster than the Python
version. I can get around this problem by replacing the Python
GzipFile object with a os.popen call to gzcat, but then I sacrifice
portability. Is there something that can be improved in the Python
version?




不要使用readline / readlines。相反,请阅读更大的块,并将其自行打破。例如,如果您认为整个文件应该适合内存,请立即阅读。


如果有帮助,请尝试编辑gzip.py以合并那种方法。


问候,

马丁



Don''t use readline/readlines. Instead, read in larger chunks, and break
it into lines yourself. For example, if you think the entire file should
fit into memory, read it at once.

If that helps, try editing gzip.py to incorporate that approach.

Regards,
Martin


我试过这个:
<来自timeit import的
*

#Try readlines

print Timer(''import

gzip; lines = gzip.GzipFile(" gztest.txt.gz")。readline s(); [i +" 1" for i in

lines]'')。timeit(200)#这是一行

#尝试文件对象 - 使用缓冲?

打印计时器(''导入gzip; [i +1for i in

gzip.GzipFile(" gztest.txt.gz")]'')。timeit(200)#这是一行


制作:


3.90938591957

3.98982691765

似乎没什么区别,可能是因为测试文件很容易

进入内存,因此磁盘缓冲无效。文件

" gztest.txt.gz"是一个包含1000行的gzip压缩文件,每行都是这是

a测试文件。

I tried this:

from timeit import *

#Try readlines
print Timer(''import
gzip;lines=gzip.GzipFile("gztest.txt.gz").readline s();[i+"1" for i in
lines]'').timeit(200) # This is one line
# Try file object - uses buffering?
print Timer(''import gzip;[i+"1" for i in
gzip.GzipFile("gztest.txt.gz")]'').timeit(200) # This is one line

Produces:

3.90938591957
3.98982691765

Doesn''t seem much difference, probably because the test file easily
gets into memory, and so disk buffering has no effect. The file
"gztest.txt.gz" is a gzipped file with 1000 lines, each being "This is
a test file".


Caleb Hattingh写道:
Caleb Hattingh wrote:
我试过这个:

来自timeit import *

#Try readlines
print timer(''import
gzip; lines = gzip.GzipFile(" gztest.txt.gz")。readline s(); [i +" 1" for for in
lines]'')。timeit(200)#这是一行

#尝试文件对象 - 使用缓冲?
打印计时器(''导入gzip; [i +1for i in
gzip.GzipFile( " gztest.txt.gz")]'')。timeit(200)#这是一行

制作:

3.90938591957
3.98982691765
似乎没什么区别,可能是因为测试文件很容易进入内存,因此磁盘缓冲没有任何影响。文件
" gztest.txt.gz"是一个包含1000行的gzip压缩文件,每行都是这是一个测试文件。
I tried this:

from timeit import *

#Try readlines
print Timer(''import
gzip;lines=gzip.GzipFile("gztest.txt.gz").readline s();[i+"1" for i in
lines]'').timeit(200) # This is one line
# Try file object - uses buffering?
print Timer(''import gzip;[i+"1" for i in
gzip.GzipFile("gztest.txt.gz")]'').timeit(200) # This is one line

Produces:

3.90938591957
3.98982691765

Doesn''t seem much difference, probably because the test file easily
gets into memory, and so disk buffering has no effect. The file
"gztest.txt.gz" is a gzipped file with 1000 lines, each being "This is
a test file".






这篇关于Python与Java gzip性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆