解开数据结构Vs.通过调用readlines()构建 [英] Unpickle a data structure Vs. build by calling readlines()

查看:83
本文介绍了解开数据结构Vs.通过调用readlines()构建的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个用例,需要从文件中的行构建一个列表.此操作将在分布式网络上潜在执行100次.我一直在使用以下明显的解决方案:

I have a use case where I need to build a list from the lines in a file. This operation will be performed potentially 100s of times on a distributed network. I've been using the obvious solution of:

with open("file.txt") as f:
    ds = f.readlines()

我只是想过,也许最好一次创建一个列表,将其腌制到一个文件中,然后使用该文件来释放每个节点上的数据.

I just had the thought that perhaps I would be better off creating this list once, pickling it into a file and then using that file to unpickle the data on each node.

如果这样做,性能会有所提高吗?

Would there be any performance increase if I did this?

推荐答案

如果这样做,性能会有所提高吗?

Would there be any performance increase if I did this?

测试一下,看看!

try:
    import cPickle as pickle
except:
    import pickle
import timeit

def lines():
    with open('lotsalines.txt') as f:
         return f.readlines()

def pickles():
    with open('lotsalines.pickle', 'rb') as f:
        return pickle.load(f)

ds = lines()
with open('lotsalines.pickle', 'wb') as f:
    t = timeit.timeit(lambda: pickle.dump(ds, file=f, protocol=-1), number=1)
print('pickle.dump: {}'.format(t))

print('readlines:   {}'.format(timeit.timeit(lines, number=10))
print('pickle.load: {}'.format(timeit.timeit(pickles, number=10))

我的"lotsalines.txt"文件就是重复的源文件,直到它的长度为655360行或15532032字节.

My 'lotsalines.txt' file is just that source duplicated until it's 655360 lines long, or 15532032 bytes.

Apple Python 2.7.2:

Apple Python 2.7.2:

readlines:   0.640027999878
pickle.load: 2.67698192596

泡菜文件是19464748字节.

And the pickle file is 19464748 bytes.

Python.org 3.3.0:

Python.org 3.3.0:

readlines:   1.5357899703085423
pickle.load: 1.5975534357130527

它是20906546字节.

And it's 20906546 bytes.

因此,至少在使用pickle协议3的情况下,Python 3比Python 2加快了pickle的速度,但是它仍然远不及简单的readlines快. (readlines在3.x中变得越来越慢,并且已被弃用.)

So, Python 3 has sped up pickle quite a bit over Python 2, at least if you use pickle protocol 3, but it's still nowhere near as fast as a simple readlines. (And readlines has gotten a lot slower in 3.x, as well as being deprecated.)

但是,实际上,如果您有性能问题,应该首先考虑是否需要list.快速测试显示,构建这种大小的list几乎是readlines成本的一半(在3.x中为Timing list(range(655360)),在2.x中为list(xrange(655360))).而且它使用了大量的内存(这实际上可能也是为什么它很慢的原因).如果您实际上不需要list(通常也不需要),只需遍历文件,即可根据需要获取行.

But really, if you've got performance concerns, you should consider whether you need the list in the first place. A quick test shows that building a list of this size is almost half the cost of the readlines (timing list(range(655360)) in 3.x, list(xrange(655360)) in 2.x). And it uses a ton of memory (which is probably actually why it's slow, too). If you don't actually need the list—and usually you don't—just iterate over the file, getting lines as you need them.

这篇关于解开数据结构Vs.通过调用readlines()构建的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆