一次读取多个Python腌制数据,缓冲和换行符? [英] Reading multiple Python pickled data at once, buffering and newlines?
问题描述
为您提供背景信息:
我有一个大文件f
,大小为几Gig.它包含通过运行
I have a large file f
, several Gigs in size. It contains consecutive pickles of different object that were generated by running
for obj in objs: cPickle.dump(obj, f)
在读取此文件时,我想利用缓冲的优势.我想要的是一次将几个拾取的对象读入缓冲区.最好的方法是什么?我想要腌制数据的readlines(buffsize)
类似物.实际上,如果所选择的数据确实是换行符分隔的,则可以使用读取行,但是我不确定这是否正确.
I want to take advantage of buffering when reading this file. What I want, is to read several picked objects into a buffer at a time. What is the best way of doing this? I want an analogue of readlines(buffsize)
for pickled data. In fact if the picked data is indeed newline delimited one could use readlines, but I am not sure if that is true.
我要记住的另一个选择是先将腌制的对象dumps()
成字符串,然后再将字符串写到文件中,每个字符串之间都用换行符分隔.要读回文件,我可以使用readlines()
和loads()
.但是我担心,腌制后的对象可能具有"\n"
字符,它会抛出该文件读取方案.我的恐惧是没有根据的吗?
Another option that I have in mind is to dumps()
the pickled object to a string first and then to write the strings to a file, each separated by a newline. To read the file back I can use readlines()
and loads()
. But I fear that a pickled object may have the "\n"
character and it will throw off this file reading scheme. Is my fear unfounded ?
一种选择是将其腌制为大量对象,但这将需要更多的内存,而我无法承受.可以通过多线程加速设置,但是我不想在缓冲正常工作之前就去那里.对于这种情况,最佳实践"是什么.
One option is to pickle it out as a huge list of objects, but that will require more memory than I can afford. The setup can be sped up by multi-threading but I do not want to go there before I get the buffering working properly. Whats the "best practice" for situations like this.
我还可以将原始字节读入缓冲区并在其上调用负载,但是我需要知道该负载消耗了该缓冲区的多少个字节,这样我才能将头扔掉.
I can also read in raw bytes into a buffer and invoke loads on that, but I need to know how many bytes of that buffer was consumed by loads so that I can throw the head away.
推荐答案
file.readlines()返回文件全部内容的列表.您需要一次阅读几行.我认为这个简单的代码应该可以释放您的数据:
file.readlines() returns a list of the entire contents of the file. You'll want to read a few lines at a time. I think this naive code should unpickle your data:
import pickle
infile = open('/tmp/pickle', 'rb')
buf = []
while True:
line = infile.readline()
if not line:
break
buf.append(line)
if line.endswith('.\n'):
print 'Decoding', buf
print pickle.loads(''.join(buf))
buf = []
如果您对生成泡菜的程序有任何控制权,我将选择以下其中一项:
If you have any control over the program that generates the pickles, I'd pick one of:
- 使用
shelve
模块. - 在将每个泡菜写入文件之前,打印每个泡菜的长度(以字节为单位),以便您确切知道每次要读取多少个字节.
- 与上述相同,但是将整数列表写入一个单独的文件中,以便您可以将这些值用作保存泡菜的文件的索引.
- 一次腌制K个对象的列表.以字节为单位写该泡菜的长度.写泡菜.重复.
- Use the
shelve
module. - Print the length (in bytes) of each pickle before writing it to the file so that you know exactly how many bytes to read in each time.
- Same as above, but write the list of integers to a separate file so that you can use those values as an index into the file holding the pickles.
- Pickle a list of K objects at a time. Write the length of that pickle in bytes. Write the pickle. Repeat.
顺便说一句,我怀疑file
的内置缓冲应该为您提供99%的性能提升.
By the way, I suspect that the file
's built-in buffering should get you 99% of the performance gains you're looking for.
如果您确信I/O阻止了您,您是否考虑过尝试mmap()
并让OS一次处理成块打包?
If you're convinced that I/O is blocking you, have you thought about trying mmap()
and letting the OS handle packing in blocks at a time?
#!/usr/bin/env python
import mmap
import cPickle
fname = '/tmp/pickle'
infile = open(fname, 'rb')
m = mmap.mmap(infile.fileno(), 0, access=mmap.ACCESS_READ)
start = 0
while True:
end = m.find('.\n', start + 1) + 2
if end == 1:
break
print cPickle.loads(m[start:end])
start = end
这篇关于一次读取多个Python腌制数据,缓冲和换行符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!