有效地压缩 numpy 数组 [英] Compress numpy arrays efficiently

查看:95
本文介绍了有效地压缩 numpy 数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在将一些numpy数组保存到磁盘时,我尝试了各种方法来进行数据压缩.

I tried various methods to do data compression when saving to disk some numpy arrays.

这些一维阵列包含一定采样率的采样数据(可以用麦克风记录声音,或用任何传感器进行任何其他测量):数据本质上是连续(在数学意义上;当然采样后现在是离散数据).

These 1D arrays contain sampled data at a certain sampling rate (can be sound recorded with a microphone, or any other measurment with any sensor) : the data is essentially continuous (in a mathematical sense ; of course after sampling it is now discrete data).

我尝试了 HDF5 (h5py) :

I tried with HDF5 (h5py) :

f.create_dataset("myarray1", myarray, compression="gzip", compression_opts=9)

但这很慢,而且压缩比不是我们所能期望的最好.

but this is quite slow, and the compression ratio is not the best we can expect.

我也试过

numpy.savez_compressed()

但再一次,它可能不是此类数据的最佳压缩算法(之前描述过).

but once again it may not be the best compression algorithm for such data (described before).

对于具有此类数据的 numpy 数组,您会选择什么来获得更好的压缩率?

What would you choose for better compression ratio on a numpy array, with such data ?

(我想到了无损 FLAC(最初为音频而设计)之类的东西,但是有没有一种简单的方法可以将这种算法应用于 numpy 数据?)

(I thought about things like lossless FLAC (initially designed for audio), but is there an easy way to apply such an algorithm on numpy data ?)

推荐答案

  1. 噪声是不可压缩的.因此,无论压缩算法如何,您拥有的任何噪声数据部分都会以 1:1 的比例进入压缩数据,除非您以某种方式丢弃它(有损压缩).如果每个样本有 24 位且有效位数 (ENOB) 等于 16 位,则剩余的 24-16 = 8 位噪声会将您的最大无损压缩比限制为 3:1,即使您的(无噪声)数据完全可压缩.非均匀噪声可压缩到其非均匀程度;您可能想查看噪声的有效熵以确定它的可压缩性.

  1. Noise is incompressible. Thus, any part of the data that you have which is noise will go into the compressed data 1:1 regardless of the compression algorithm, unless you discard it somehow (lossy compression). If you have a 24 bits per sample with effective number of bits (ENOB) equal to 16 bits, the remaining 24-16 = 8 bits of noise will limit your maximum lossless compression ratio to 3:1, even if your (noiseless) data is perfectly compressible. Non-uniform noise is compressible to the extent to which it is non-uniform; you probably want to look at the effective entropy of the noise to determine how compressible it is.

压缩数据基于对其进行建模(部分是为了消除冗余,部分是为了您可以从噪声中分离并丢弃噪声).例如,如果您知道您的数据带宽限制为 10MHz 并且您以 200MHz 采样,您可以执行 FFT,将高频清零,并仅存储低频系数(在本例中:10:1压缩).有一整个领域叫做压缩感知",与此相关.

Compressing data is based on modelling it (partly to remove redundancy, but also partly so you can separate from noise and discard the noise). For example, if you know your data is bandwidth limited to 10MHz and you're sampling at 200MHz, you can do an FFT, zero out the high frequencies, and store the coefficients for the low frequencies only (in this example: 10:1 compression). There is a whole field called "compressive sensing" which is related to this.

一个实用的建议,适用于多种合理连续的数据:去噪 -> 带宽限制 -> delta 压缩 -> gzip(或 xz 等).去噪可能与带宽限制相同,也可能与运行中值之类的非线性滤波器相同.带宽限制可以通过 FIR/IIR 实现.增量压缩只是 y[n] = x[n] - x[n-1].

A practical suggestion, suitable for many kinds of reasonably continuous data: denoise -> bandwidth limit -> delta compress -> gzip (or xz, etc). Denoise could be the same as bandwidth limit, or a nonlinear filter like a running median. Bandwidth limit can be implemented with FIR/IIR. Delta compress is just y[n] = x[n] - x[n-1].

编辑一个插图:

from pylab import *
import numpy
import numpy.random
import os.path
import subprocess

# create 1M data points of a 24-bit sine wave with 8 bits of gaussian noise (ENOB=16)
N = 1000000
data = (sin( 2 * pi * linspace(0,N,N) / 100 ) * (1<<23) + \
    numpy.random.randn(N) * (1<<7)).astype(int32)

numpy.save('data.npy', data)
print os.path.getsize('data.npy')
# 4000080 uncompressed size

subprocess.call('xz -9 data.npy', shell=True)
print os.path.getsize('data.npy.xz')
# 1484192 compressed size
# 11.87 bits per sample, ~8 bits of that is noise

data_quantized = data / (1<<8)
numpy.save('data_quantized.npy', data_quantized)
subprocess.call('xz -9 data_quantized.npy', shell=True)
print os.path.getsize('data_quantized.npy.xz')
# 318380
# still have 16 bits of signal, but only takes 2.55 bits per sample to store it

这篇关于有效地压缩 numpy 数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆