存储整数列表的最有效方法 [英] Most efficient way to store list of integers

查看:138
本文介绍了存储整数列表的最有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近做了一个项目,其中一个目标是使用尽可能少的内存来使用Python 3存储一系列文件。除了一个列表之外,几乎所有文件占用的空间都很小。整数大约 333,000 整数长整数,大小约为 8000

I have recently been doing a project in which one of the aims is to use as little memory as possible to store a series of files using Python 3. Almost all of the files take up very little space, apart from one list of integers that is roughly 333,000 integers long and has integers up to about 8000 in size.

我目前正在使用 pickle 来存储该列表,该列表占用 7mb左右,但我觉得必须有一个更有效的内存方式才能做到这一点。

I'm currently using pickle to store the list, which takes up around 7mb, but I feel like there must be a more memory efficient way to do this.

我已经尝试将其存储为文本文件并且 csv ,bur使用超过 10mb 的空间。

I have tried storing it as a text file and csv, bur both of these used in excess of 10mb of space.

推荐答案

这是一个小型演示,它使用Pandas模块:

Here is a small demo, which uses Pandas module:

import numpy as np
import pandas as pd
import feather

# let's generate an array of 1M int64 elements...
df = pd.DataFrame({'num_col':np.random.randint(0, 10**9, 10**6)}, dtype=np.int64)
df.info()

%timeit -n 1 -r 1 df.to_pickle('d:/temp/a.pickle')

%timeit -n 1 -r 1 df.to_hdf('d:/temp/a.h5', 'df_key', complib='blosc', complevel=5)
%timeit -n 1 -r 1 df.to_hdf('d:/temp/a_blosc.h5', 'df_key', complib='blosc', complevel=5)
%timeit -n 1 -r 1 df.to_hdf('d:/temp/a_zlib.h5', 'df_key', complib='zlib', complevel=5)
%timeit -n 1 -r 1 df.to_hdf('d:/temp/a_bzip2.h5', 'df_key', complib='bzip2', complevel=5)
%timeit -n 1 -r 1 df.to_hdf('d:/temp/a_lzo.h5', 'df_key', complib='lzo', complevel=5)

%timeit -n 1 -r 1 feather.write_dataframe(df, 'd:/temp/a.feather')

DataFrame信息:

In [56]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 1 columns):
num_col    1000000 non-null int64
dtypes: int64(1)
memory usage: 7.6 MB

结果(速度):

In [49]: %timeit -n 1 -r 1 df.to_pickle('d:/temp/a.pickle')
1 loop, best of 1: 16.2 ms per loop

In [50]: %timeit -n 1 -r 1 df.to_hdf('d:/temp/a.h5', 'df_key', complib='blosc', complevel=5)
1 loop, best of 1: 39.7 ms per loop

In [51]: %timeit -n 1 -r 1 df.to_hdf('d:/temp/a_blosc.h5', 'df_key', complib='blosc', complevel=5)
1 loop, best of 1: 40.6 ms per loop

In [52]: %timeit -n 1 -r 1 df.to_hdf('d:/temp/a_zlib.h5', 'df_key', complib='zlib', complevel=5)
1 loop, best of 1: 213 ms per loop

In [53]: %timeit -n 1 -r 1 df.to_hdf('d:/temp/a_bzip2.h5', 'df_key', complib='bzip2', complevel=5)
1 loop, best of 1: 1.09 s per loop

In [54]: %timeit -n 1 -r 1 df.to_hdf('d:/temp/a_lzo.h5', 'df_key', complib='lzo', complevel=5)
1 loop, best of 1: 32.1 ms per loop

In [55]: %timeit -n 1 -r 1 feather.write_dataframe(df, 'd:/temp/a.feather')
1 loop, best of 1: 3.49 ms per loop

结果(大小):

{ temp }  » ls -lh a*                                                                                         /d/temp
-rw-r--r-- 1 Max None 7.7M Sep 20 23:15 a.feather
-rw-r--r-- 1 Max None 4.1M Sep 20 23:15 a.h5
-rw-r--r-- 1 Max None 7.7M Sep 20 23:15 a.pickle
-rw-r--r-- 1 Max None 4.1M Sep 20 23:15 a_blosc.h5
-rw-r--r-- 1 Max None 4.0M Sep 20 23:15 a_bzip2.h5
-rw-r--r-- 1 Max None 4.1M Sep 20 23:15 a_lzo.h5
-rw-r--r-- 1 Max None 3.9M Sep 20 23:15 a_zlib.h5

结论:关注HDF5(+ blosc lzo 压缩)如果你需要速度和合理的尺寸或羽毛格式如果你只关心速度 - 它比Pickle快4倍!

Conclusion: pay attention at HDF5 (+ blosc or lzo compression) if you need both speed and a reasonable size or at Feather-format if you only care of speed - it's 4 times faster compared to Pickle!

这篇关于存储整数列表的最有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆