numpy 数组的最快保存和加载选项 [英] Fastest save and load options for a numpy array

查看:49
本文介绍了numpy 数组的最快保存和加载选项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个脚本,它生成二维 numpy arraydtype=float(1e3, 1e6).现在我正在使用 np.savenp.load 来对数组执行 IO 操作.但是,对于每个数组,这些函数需要几秒钟的时间.是否有更快的方法来保存和加载整个数组(即,不对它们的内容进行假设并减少它们)?只要数据完全保留,我愿意在保存之前将 array 转换为另一种类型.

I have a script that generates two-dimensional numpy arrays with dtype=float and shape on the order of (1e3, 1e6). Right now I'm using np.save and np.load to perform IO operations with the arrays. However, these functions take several seconds for each array. Are there faster methods for saving and loading the entire arrays (i.e., without making assumptions about their contents and reducing them)? I'm open to converting the arrays to another type before saving as long as the data are retained exactly.

推荐答案

对于非常大的数组,我听说过几种解决方案,他们主要是在 I/O 上懒惰:

For really big arrays, I've heard about several solutions, and they mostly on being lazy on the I/O :

  • NumPy.memmap,将大数组映射到二进制形式
    • 优点:
      • 除了 Numpy 之外没有其他依赖
      • ndarray 的透明替换(任何接受 ndarray 的类都接受 memmap )
      • NumPy.memmap, maps big arrays to binary form
        • Pros :
          • No dependency other than Numpy
          • Transparent replacement of ndarray (Any class accepting ndarray accepts memmap )
          • 阵列的块数限制为 2.5G
          • 仍受 Numpy 吞吐量限制

          对 HDF5 使用 Python 绑定,这是一种支持大数据的文件格式,例如 PyTablesh5py

          Use Python bindings for HDF5, a bigdata-ready file format, like PyTables or h5py

          • 优点:
            • 格式支持压缩、索引和其他超棒的功能
            • 显然是终极 PetaByte 大文件格式
            • 具有分层格式的学习曲线?
            • 必须定义您的性能需求(见下文)

            Python 的酸洗 系统(在比赛之外,提到Pythonicity 而不是速度)

            Python's pickling system (out of the race, mentioned for Pythonicity rather than speed)

            • 优点:
              • 这是 Python 风格!(哈哈)
              • 支持各种对象
              • 可能比其他人慢(因为针对任何对象而不是数组)

              来自 NumPy.memmap 的文档:

              为存储在磁盘上的二进制文件中的数组创建内存映射.

              Create a memory-map to an array stored in a binary file on disk.

              内存映射文件用于访问磁盘上大文件的小片段,而不用将整个文件读入内存

              Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory

              在接受 ndarray 的任何地方都可以使用 memmap 对象.给定任何 memmap fpisinstance(fp, numpy.ndarray) 返回 True.

              The memmap object can be used anywhere an ndarray is accepted. Given any memmap fp , isinstance(fp, numpy.ndarray) returns True.

              <小时>

              HDF5 阵列

              来自 h5py 文档

              让您存储大量数值数据,并轻松地从 NumPy 操作这些数据.例如,您可以将存储在磁盘上的数 TB 数据集切片,就好像它们是真正的 NumPy 数组一样.数以千计的数据集可以存储在一个文件中,可以根据需要进行分类和标记.

              Lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.

              该格式支持以各种方式压缩数据(为相同的 I/O 读取加载更多位),但这意味着单独查询数据变得不那么容易,但在您的情况下(纯粹加载/转储数组)它可能高效

              The format supports compression of data in various ways (more bits loaded for same I/O read), but this means that the data becomes less easy to query individually, but in your case (purely loading / dumping arrays) it might be efficient

              这篇关于numpy 数组的最快保存和加载选项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆