Python:从/向内存压缩和保存/加载大数据 [英] Python: compress and save/load large data from/into memory

查看:38
本文介绍了Python:从/向内存压缩和保存/加载大数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个巨大的字典,其中包含 numpy 数组作为值,几乎消耗了所有 RAM.不可能完全腌制或压缩它.我检查了一些解决方案使用 zlib 分块读/写,但它们处理文件、StringIO 等,当我想从/向 RAM 读/写时.

I have a huge dictionary with numpy arrays as values which consumes almost all RAM. There is no possibility to pickle or compress it entirely. I've checked some of solutions to read/write in chunks using zlib, but they work with files, StringIO, etc, when I want to read/write from/into RAM.

这是最接近我想要的例子,但它只有写作部分.这样保存后如何读取对象,因为块是一起写的,压缩块当然有不同的长度?

Here is the closest example to what I want, but it has only writing part. How can I read the object after saving this way, because chunks were written together and compressed chunks of course have different length?

import zlib


class ZlibWrapper():
    # chunksize is used to save memory, otherwise huge object will be copied
    def __init__(self, filename, chunksize=268435456): # 256 MB
        self.filename = filename
        self.chunksize = chunksize


    def save(self, data): 
        """Saves a compressed object to disk
        """
        mdata = memoryview(data)
        with open(self.filename, 'wb') as f:
          for i in range(0, len(mdata), self.chunksize):
             mychunk = zlib.compress(bytes(mdata[i:i+self.chunksize]))
             f.write(mychunk)

    def load(self):

        # ???

        return data

不幸的是,未压缩的对象太大而无法通过网络发送,并且在外部压缩它们会造成额外的复杂性.

Uncompressed objects unfortunately would be too huge to be sent over network, and zipping them externally would create additional complications.

不幸的是,Pickle 开始消耗 RAM 并且系统挂起.

Pickle unfortunately starts to consume RAM and system hangs.

在与 Charles Duffy 讨论之后,这是我的序列化尝试(目前不起作用 - 甚至不压缩字符串):

Following the discussion with Charles Duffy, here is my attempt of serialization (does not work at the moment - does not even compress the strings):

import zlib

import json

import numpy as np



mydict = {"a":np.array([1,2,3]),"b":np.array([4,5,6]),"c":np.array([0,0,0])}


#------------


# write to compressed stream ---------------------

def string_stream_serialization(dic):
    for key, val in dic.items():        
        #key_encoded = key.encode("utf-8")  # is not json serializable
        yield json.dumps([key,val.tolist()])


output = ""
compressor = zlib.compressobj()
decompressor = zlib.decompressobj()


stream = string_stream_serialization(mydict)

with open("outfile.compressed", "wb") as f:
    for s in stream:
        if not s:
            f.write(compressor.flush())
            break
        f.write(compressor.compress(s.encode('utf-8'))) # .encode('utf-8') converts to bytes




# read from compressed stream: --------------------

def read_in_chunks(file_object, chunk_size=1024): # I set another chunk size intentionally
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data


reconstructed = {}

with open("outfile.compressed", "rb") as f:
    for s in read_in_chunks(f):
        data = decompressor.decompress(decompressor.unconsumed_tail + s)
        while data:
            arr = json.loads(data.decode("utf-8"))            
            reconstructed[arr[0]] = np.array(arr[1])
            data = decompressor.decompress(decompressor.unconsumed_tail)


print(reconstructed)

推荐答案

您的首要关注点应该是拥有一种合理的方式来序列化和反序列化您的数据.我们对您在问题本身,或在相同的评论中:

Your first focus should be on having a sane way to serialize and deserialize your data. We have several constraints about your data provided in the question itself, or in comments on same:

  • 您的数据由包含大量键/值对的字典组成
  • 所有键都是Unicode字符串
  • 所有值都是 numpy 数组,它们各自足够短,可以在任何给定时间轻松放入内存中(甚至允许任何单个值的多个副本),尽管总的来说所需的存储空间变得非常大大.
  • Your data consists of a dictionary with a very large number of key/value pairs
  • All keys are unicode strings
  • All values are numpy arrays which are individually short enough to easily fit in memory at any given time (or even to allow multiple copies of any single value), although in aggregate the storage required becomes extremely large.

这表明了一个相当简单的实现:

This suggests a fairly simple implementation:

def serialize(f, content):
    for k,v in content.items():
        # write length of key, followed by key as string
        k_bstr = k.encode('utf-8')
        f.write(struct.pack('L', len(k_bstr)))
        f.write(k_bstr)
        # write length of value, followed by value in numpy.save format
        memfile = io.BytesIO()
        numpy.save(memfile, v)
        f.write(struct.pack('L', memfile.tell()))
        f.write(memfile.getvalue())

def deserialize(f):
    retval = {}
    while True:
        content = f.read(struct.calcsize('L'))
        if not content: break
        k_len = struct.unpack('L', content)[0]
        k_bstr = f.read(k_len)
        k = k_bstr.decode('utf-8')
        v_len = struct.unpack('L', f.read(struct.calcsize('L')))[0]
        v_bytes = io.BytesIO(f.read(v_len))
        v = numpy.load(v_bytes)
        retval[k] = v
    return retval

作为一个简单的测试:

test_file = io.BytesIO()
serialize(test_file, {
    "First Key": numpy.array([123,234,345]),
    "Second Key": numpy.array([321,432,543]),
})

test_file.seek(0)
print(deserialize(test_file))

...所以,我们知道了——现在,我们如何添加压缩?很容易.

...so, we've got that -- now, how do we add compression? Easily.

with gzip.open('filename.gz', 'wb') as gzip_file:
    serialize(gzip_file, your_data)

...或者,在减压方面:

...or, on the decompression side:

with gzip.open('filename.gz', 'rb') as gzip_file:
    your_data = deserialize(gzip_file)

这是有效的,因为 gzip 库已经根据请求将数据流式传输出去,而不是一次压缩或解压缩它. 无需进行窗口化和分块你自己——把它留给下层.

This works because the gzip library already streams data out as it's requested, rather than compressing it or decompressing it all at once. There's no need to do windowing and chunking yourself -- just leave it to the lower layer.

这篇关于Python:从/向内存压缩和保存/加载大数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆