使用 pickle 对大型 numpy 数组进行反序列化比使用 numpy 慢一个数量级 [英] Deserialization of large numpy arrays using pickle is order of magnitude slower than using numpy

查看:123
本文介绍了使用 pickle 对大型 numpy 数组进行反序列化比使用 numpy 慢一个数量级的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在反序列化大型 numpy 数组(本例中为 500MB),我发现结果在不同方法之间存在数量级差异.以下是我计时的 3 种方法.

I am deserializing large numpy arrays (500MB in this example) and I find the results vary by orders of magnitude between approaches. Below are the 3 approaches I've timed.

我从 multiprocessing.shared_memory 包接收数据,所以数据作为 memoryview 对象来到我这里.但在这些简单的例子中,我只是预先创建了一个字节数组来运行测试.

I'm receiving the data from the multiprocessing.shared_memory package, so the data comes to me as a memoryview object. But in these simple examples, I just pre-create a byte array to run the test.

我想知道这些方法是否有任何错误,或者是否有我没有尝试过的其他技术.如果您想快速移动数据而不是仅为 IO 锁定 GIL,那么 Python 中的反序列化是一个真正的问题.一个很好的解释为什么这些方法变化如此之大也是一个很好的答案.

I wonder if there are any mistakes in these approaches, or if there are other techniques I didn't try. Deserialization in Python is a real pickle of a problem if you want to move data fast and not lock the GIL just for the IO. A good explanation as to why these approaches vary so much would also be a good answer.

""" Deserialization speed test """
import numpy as np
import pickle
import time
import io


sz = 524288000
sample = np.random.randint(0, 255, size=sz, dtype=np.uint8)  # 500 MB data
serialized_sample = pickle.dumps(sample)
serialized_bytes = sample.tobytes()
serialized_bytesio = io.BytesIO()
np.save(serialized_bytesio, sample, allow_pickle=False)
serialized_bytesio.seek(0)

result = None

print('Deserialize using pickle...')
t0 = time.time()
result = pickle.loads(serialized_sample)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize from bytes...')
t0 = time.time()
result = np.ndarray(shape=sz, dtype=np.uint8, buffer=serialized_bytes)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize using numpy load from BytesIO...')
t0 = time.time()
result = np.load(serialized_bytesio, allow_pickle=False)
print('Time: {:.10f} sec'.format(time.time() - t0))

结果:

Deserialize using pickle...
Time: 0.2509949207 sec
Deserialize from bytes...
Time: 0.0204288960 sec
Deserialize using numpy load from BytesIO...
Time: 28.9850852489 sec

第二个选项是最快的,但显然不太优雅,因为我需要显式序列化形状和数据类型信息.

The second option is the fastest, but notably less elegant because I need to explicitly serialize the shape and dtype information.

推荐答案

我发现你的问题很有用,我正在寻找最好的 numpy 序列化并确认 np.load() 是最好的,除了它被 pyarrow 击败 在我下面的附加测试中.Arrow 现在是一个非常流行的分布式计算数据序列化框架(例如 Spark,...)

I found your question useful, I'm looking for best numpy serialization and confirmed that np.load() was best except it was beaten by pyarrow in my add on test below. Arrow is now a super popular data serialization framework for distributed compute (E.g. Spark, ...)

""" Deserialization speed test """
import numpy as np
import pickle
import time
import io
import pyarrow as pa


sz = 524288000
sample = np.random.randint(0, 255, size=sz, dtype=np.uint8)  # 500 MB data
pa_buf = pa.serialize(sample).to_buffer()

serialized_sample = pickle.dumps(sample)
serialized_bytes = sample.tobytes()
serialized_bytesio = io.BytesIO()
np.save(serialized_bytesio, sample, allow_pickle=False)
serialized_bytesio.seek(0)

result = None

print('Deserialize using pickle...')
t0 = time.time()
result = pickle.loads(serialized_sample)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize from bytes...')
t0 = time.time()
result = np.ndarray(shape=sz, dtype=np.uint8, buffer=serialized_bytes)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize using numpy load from BytesIO...')
t0 = time.time()
result = np.load(serialized_bytesio, allow_pickle=False)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize pyarrow')
t0 = time.time()
restored_data = pa.deserialize(pa_buf)
print('Time: {:.10f} sec'.format(time.time() - t0))

i3.2xlarge 在 Databricks Runtime 8.3ML Python 3.8、Numpy 1.19.2、Pyarrow 1.0.1 上的结果

Results from i3.2xlarge on Databricks Runtime 8.3ML Python 3.8, Numpy 1.19.2, Pyarrow 1.0.1

Deserialize using pickle...
Time: 0.4069395065 sec
Deserialize from bytes...
Time: 0.0281322002 sec
Deserialize using numpy load from BytesIO...
Time: 0.3059172630 sec
Deserialize pyarrow
Time: 0.0031735897 sec

您的 BytesIO 结果大约是我的 100 倍,我不知道为什么.

Your BytesIO results were about 100x more than mine, which I don't know why.

这篇关于使用 pickle 对大型 numpy 数组进行反序列化比使用 numpy 慢一个数量级的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆