一次读取多个Python腌制数据,缓冲和换行符? [英] Reading multiple Python pickled data at once, buffering and newlines?

查看:117
本文介绍了一次读取多个Python腌制数据,缓冲和换行符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为您提供背景信息:

我有一个大文件f,大小为几Gig.它包含通过运行

I have a large file f, several Gigs in size. It contains consecutive pickles of different object that were generated by running

for obj in objs: cPickle.dump(obj, f)

在读取此文件时,我想利用缓冲的优势.我想要的是一次将几个拾取的对象读入缓冲区.最好的方法是什么?我想要腌制数据的readlines(buffsize)类似物.实际上,如果所选择的数据确实是换行符分隔的,则可以使用读取行,但是我不确定这是否正确.

I want to take advantage of buffering when reading this file. What I want, is to read several picked objects into a buffer at a time. What is the best way of doing this? I want an analogue of readlines(buffsize) for pickled data. In fact if the picked data is indeed newline delimited one could use readlines, but I am not sure if that is true.

我要记住的另一个选择是先将腌制的对象dumps()成字符串,然后再将字符串写到文件中,每个字符串之间都用换行符分隔.要读回文件,我可以使用readlines()loads().但是我担心,腌制后的对象可能具有"\n"字符,它会抛出该文件读取方案.我的恐惧是没有根据的吗?

Another option that I have in mind is to dumps() the pickled object to a string first and then to write the strings to a file, each separated by a newline. To read the file back I can use readlines() and loads(). But I fear that a pickled object may have the "\n" character and it will throw off this file reading scheme. Is my fear unfounded ?

一种选择是将其腌制为大量对象,但这将需要更多的内存,而我无法承受.可以通过多线程加速设置,但是我不想在缓冲正常工作之前就去那里.对于这种情况,最佳实践"是什么.

One option is to pickle it out as a huge list of objects, but that will require more memory than I can afford. The setup can be sped up by multi-threading but I do not want to go there before I get the buffering working properly. Whats the "best practice" for situations like this.

我还可以将原始字节读入缓冲区并在其上调用负载,但是我需要知道该负载消耗了该缓冲区的多少个字节,这样我才能将头扔掉.

I can also read in raw bytes into a buffer and invoke loads on that, but I need to know how many bytes of that buffer was consumed by loads so that I can throw the head away.

推荐答案

file.readlines()返回文件全部内容的列表.您需要一次阅读几行.我认为这个简单的代码应该可以释放您的数据:

file.readlines() returns a list of the entire contents of the file. You'll want to read a few lines at a time. I think this naive code should unpickle your data:

import pickle
infile = open('/tmp/pickle', 'rb')
buf = []
while True:
    line = infile.readline()
    if not line:
        break
    buf.append(line)
    if line.endswith('.\n'):
        print 'Decoding', buf
        print pickle.loads(''.join(buf))
        buf = []

如果您对生成泡菜的程序有任何控制权,我将选择以下其中一项:

If you have any control over the program that generates the pickles, I'd pick one of:

  1. 使用shelve模块.
  2. 在将每个泡菜写入文件之前,打印每个泡菜的长度(以字节为单位),以便您确切知道每次要读取多少个字节.
  3. 与上述相同,但是将整数列表写入一个单独的文件中,以便您可以将这些值用作保存泡菜的文件的索引.
  4. 一次腌制K个对象的列表.以字节为单位写该泡菜的长度.写泡菜.重复.
  1. Use the shelve module.
  2. Print the length (in bytes) of each pickle before writing it to the file so that you know exactly how many bytes to read in each time.
  3. Same as above, but write the list of integers to a separate file so that you can use those values as an index into the file holding the pickles.
  4. Pickle a list of K objects at a time. Write the length of that pickle in bytes. Write the pickle. Repeat.

顺便说一句,我怀疑file的内置缓冲应该为您提供99%的性能提升.

By the way, I suspect that the file's built-in buffering should get you 99% of the performance gains you're looking for.

如果您确信I/O阻止了您,您是否考虑过尝试mmap()并让OS一次处理成块打包?

If you're convinced that I/O is blocking you, have you thought about trying mmap() and letting the OS handle packing in blocks at a time?

#!/usr/bin/env python

import mmap
import cPickle

fname = '/tmp/pickle'
infile = open(fname, 'rb')
m = mmap.mmap(infile.fileno(), 0, access=mmap.ACCESS_READ)
start = 0
while True:
    end = m.find('.\n', start + 1) + 2
    if end == 1:
        break
    print cPickle.loads(m[start:end])
    start = end

这篇关于一次读取多个Python腌制数据,缓冲和换行符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆