最快的保存和加载选项的numpy的阵列 [英] Fastest save and load options for a numpy array

查看:472
本文介绍了最快的保存和加载选项的numpy的阵列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个生成二维 numpy的 阵列脚本 s的 DTYPE =浮动的顺序形状(1E3,1e6个电子)。现在,我使用 np.save np.load 与阵列进行IO操作。但是,这些功能需要几秒钟为每个阵列。是否有用于保存和加载整个阵列更快的方法(即,没有对他们的内容的假设和减少它们)?我接受,只要数据完全保留保存之前的阵列 S转换为另一种类型。

I have a script that generates two-dimensional numpy arrays with dtype=float and shape on the order of (1e3, 1e6). Right now I'm using np.save and np.load to perform IO operations with the arrays. However, these functions take several seconds for each array. Are there faster methods for saving and loading the entire arrays (i.e., without making assumptions about their contents and reducing them)? I'm open to converting the arrays to another type before saving as long as the data are retained exactly.

推荐答案

有关非常大的阵列,我听说过几种解决方案,他们大多是在I / O懒惰:

For really big arrays, I've heard about several solutions, and they mostly on being lazy on the I/O :


  • NumPy.memmap ,地图大阵列二进制形式

    • 优点:

      • 不超过numpy的依赖其他

      • 透明更换 ndarray (任何类接受ndarray接受 MEMMAP

      • NumPy.memmap, maps big arrays to binary form
        • Pros :
          • No dependency other than Numpy
          • Transparent replacement of ndarray (Any class accepting ndarray accepts memmap )

          • 您的阵列的数据块被限制在2.5G

          • 仍numpy的吞吐量限制

          使用Python绑定HDF5,一个bigdata就绪文件格式,如 PyTables h5py

          Use Python bindings for HDF5, a bigdata-ready file format, like PyTables or h5py


          • 优点:

            • 格式支持COM pression,索引和其他超级不错的功能

            • 显然,最终PB级的大文件格式


            • 有一个分层格式的学习曲线?

            • 必须确定你的性能需求(见下文)

            Python的酸洗系统(退赛,提到Pythonicity而不是速度)

            Python's pickling system (out of the race, mentioned for Pythonicity rather than speed)


            • 优点:

              • 这是Python的! (哈哈)

              • 支持各种对象


              • 可能比别人慢(因为针对任何对象不是数组)

              NumPy.memmap 文档:

              创建内存映射到存储在磁盘上的二进制文件的数组。

              Create a memory-map to an array stored in a binary file on disk.

              内存映射文件被用于访问磁盘上的大文件的小片段,而不读取整个文件到内存

              Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory

              该MEMMAP对象可以在任何地方使用的ndarray被接受。鉴于任何MEMMAP FP isinstance(FP,numpy.ndarray)返回True。

              The memmap object can be used anywhere an ndarray is accepted. Given any memmap fp , isinstance(fp, numpy.ndarray) returns True.

              h5py DOC

              让您的数字存储大量的数据,并轻松操作从numpy的数据。例如,您可以切成存储在磁盘上多TB数据集,好像他们是真正的numpy的阵列。成千上万的数据集可以存储在一个单一的文件,分类和标记但是你想。

              Lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.

              该格式支持以各种方式的数据的玉米pression(更多比特加载相同的I / O读),但是这意味着数据变得不那么容易单独进行查询,但你的情况(单纯装载/转储数组)可能是有效的

              The format supports compression of data in various ways (more bits loaded for same I/O read), but this means that the data becomes less easy to query individually, but in your case (purely loading / dumping arrays) it might be efficient

              这篇关于最快的保存和加载选项的numpy的阵列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆