存储整数列表的最有效方法 [英] Most efficient way to store list of integers
问题描述
我最近做了一个项目,其中一个目标是使用尽可能少的内存来使用Python 3存储一系列文件。除了一个列表之外,几乎所有文件占用的空间都很小。整数大约 333,000
整数长整数,大小约为 8000
。
I have recently been doing a project in which one of the aims is to use as little memory as possible to store a series of files using Python 3. Almost all of the files take up very little space, apart from one list of integers that is roughly 333,000
integers long and has integers up to about 8000
in size.
我目前正在使用 pickle
来存储该列表,该列表占用 7mb左右
,但我觉得必须有一个更有效的内存方式才能做到这一点。
I'm currently using pickle
to store the list, which takes up around 7mb
, but I feel like there must be a more memory efficient way to do this.
我已经尝试将其存储为文本文件并且 csv
,bur使用超过 10mb
的空间。
I have tried storing it as a text file and csv
, bur both of these used in excess of 10mb
of space.
推荐答案
这是一个小型演示,它使用Pandas模块:
Here is a small demo, which uses Pandas module:
import numpy as np
import pandas as pd
import feather
# let's generate an array of 1M int64 elements...
df = pd.DataFrame({'num_col':np.random.randint(0, 10**9, 10**6)}, dtype=np.int64)
df.info()
%timeit -n 1 -r 1 df.to_pickle('d:/temp/a.pickle')
%timeit -n 1 -r 1 df.to_hdf('d:/temp/a.h5', 'df_key', complib='blosc', complevel=5)
%timeit -n 1 -r 1 df.to_hdf('d:/temp/a_blosc.h5', 'df_key', complib='blosc', complevel=5)
%timeit -n 1 -r 1 df.to_hdf('d:/temp/a_zlib.h5', 'df_key', complib='zlib', complevel=5)
%timeit -n 1 -r 1 df.to_hdf('d:/temp/a_bzip2.h5', 'df_key', complib='bzip2', complevel=5)
%timeit -n 1 -r 1 df.to_hdf('d:/temp/a_lzo.h5', 'df_key', complib='lzo', complevel=5)
%timeit -n 1 -r 1 feather.write_dataframe(df, 'd:/temp/a.feather')
DataFrame信息:
In [56]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 1 columns):
num_col 1000000 non-null int64
dtypes: int64(1)
memory usage: 7.6 MB
结果(速度):
In [49]: %timeit -n 1 -r 1 df.to_pickle('d:/temp/a.pickle')
1 loop, best of 1: 16.2 ms per loop
In [50]: %timeit -n 1 -r 1 df.to_hdf('d:/temp/a.h5', 'df_key', complib='blosc', complevel=5)
1 loop, best of 1: 39.7 ms per loop
In [51]: %timeit -n 1 -r 1 df.to_hdf('d:/temp/a_blosc.h5', 'df_key', complib='blosc', complevel=5)
1 loop, best of 1: 40.6 ms per loop
In [52]: %timeit -n 1 -r 1 df.to_hdf('d:/temp/a_zlib.h5', 'df_key', complib='zlib', complevel=5)
1 loop, best of 1: 213 ms per loop
In [53]: %timeit -n 1 -r 1 df.to_hdf('d:/temp/a_bzip2.h5', 'df_key', complib='bzip2', complevel=5)
1 loop, best of 1: 1.09 s per loop
In [54]: %timeit -n 1 -r 1 df.to_hdf('d:/temp/a_lzo.h5', 'df_key', complib='lzo', complevel=5)
1 loop, best of 1: 32.1 ms per loop
In [55]: %timeit -n 1 -r 1 feather.write_dataframe(df, 'd:/temp/a.feather')
1 loop, best of 1: 3.49 ms per loop
结果(大小):
{ temp } » ls -lh a* /d/temp
-rw-r--r-- 1 Max None 7.7M Sep 20 23:15 a.feather
-rw-r--r-- 1 Max None 4.1M Sep 20 23:15 a.h5
-rw-r--r-- 1 Max None 7.7M Sep 20 23:15 a.pickle
-rw-r--r-- 1 Max None 4.1M Sep 20 23:15 a_blosc.h5
-rw-r--r-- 1 Max None 4.0M Sep 20 23:15 a_bzip2.h5
-rw-r--r-- 1 Max None 4.1M Sep 20 23:15 a_lzo.h5
-rw-r--r-- 1 Max None 3.9M Sep 20 23:15 a_zlib.h5
结论:关注HDF5(+ blosc
或 lzo
压缩)如果你需要速度和合理的尺寸或羽毛格式如果你只关心速度 - 它比Pickle快4倍!
Conclusion: pay attention at HDF5 (+ blosc
or lzo
compression) if you need both speed and a reasonable size or at Feather-format if you only care of speed - it's 4 times faster compared to Pickle!
这篇关于存储整数列表的最有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!