具有约 2000 万个样本点和千兆字节数据的交互式大图 [英] Interactive large plot with ~20 million sample points and gigabytes of data

查看:10
本文介绍了具有约 2000 万个样本点和千兆字节数据的交互式大图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这里遇到了一个问题(我的 RAM):它无法保存我想要绘制的数据.我确实有足够的高清空间.有什么解决方案可以避免我的数据集出现阴影"?

I have got a problem (with my RAM) here: it's not able to hold the data I want to plot. I do have sufficient HD space. Is there any solution to avoid that "shadowing" of my data-set?

具体来说,我处理数字信号处理,我必须使用高采样率.我的框架(GNU Radio)以二进制形式保存这些值(以避免使用过多的磁盘空间).我打开它.之后我需要绘图.我需要可缩放和交互式的绘图.这是一个问题.

Concretely I deal with Digital Signal Processing and I have to use a high sample-rate. My framework (GNU Radio) saves the values (to avoid using too much disk space) in binary. I unpack it. Afterwards I need to plot. I need the plot zoomable, and interactive. And that is an issue.

这是否有任何优化潜力,或者其他可以处理更大数据集的软件/编程语言(如 R 左右)?实际上我想要更多的数据在我的情节中.但我没有使用其他软件的经验.GNUplot 失败,使用与以下类似的方法.我不知道 R (jet).

Is there any optimization potential to this, or another software/programming language (like R or so) which can handle larger data-sets? Actually I want much more data in my plots. But I have no experience with other software. GNUplot fails, with a similar approach to the following. I don't know R (jet).

import matplotlib.pyplot as plt
import matplotlib.cbook as cbook
import struct

"""
plots a cfile

cfile - IEEE single-precision (4-byte) floats, IQ pairs, binary
txt - index,in-phase,quadrature in plaintext

note: directly plotting with numpy results into shadowed functions
"""

# unpacking the cfile dataset
def unpack_set(input_filename, output_filename):
    index = 0   # index of the samples
    output_filename = open(output_filename, 'wb')

    with open(input_filename, "rb") as f:

        byte = f.read(4)    # read 1. column of the vector

        while byte != "":
        # stored Bit Values
            floati = struct.unpack('f', byte)   # write value of 1. column to a variable
            byte = f.read(4)            # read 2. column of the vector
            floatq = struct.unpack('f', byte)   # write value of 2. column to a variable
            byte = f.read(4)            # next row of the vector and read 1. column
            # delimeter format for matplotlib 
            lines = ["%d," % index, format(floati), ",",  format(floatq), "
"]
            output_filename.writelines(lines)
            index = index + 1
    output_filename.close
    return output_filename.name

# reformats output (precision configuration here)
def format(value):
    return "%.8f" % value            

# start
def main():

    # specify path
    unpacked_file = unpack_set("test01.cfile", "test01.txt")
    # pass file reference to matplotlib
    fname = str(unpacked_file)
    plt.plotfile(fname, cols=(0,1)) # index vs. in-phase

    # optional
    # plt.axes([0, 0.5, 0, 100000]) # for 100k samples
    plt.grid(True)
    plt.title("Signal-Diagram")
    plt.xlabel("Sample")
    plt.ylabel("In-Phase")

    plt.show();

if __name__ == "__main__":
    main()

像 plt.swap_on_disk() 这样的东西可以在我的 SSD 上缓存这些东西;)

Something like plt.swap_on_disk() could cache the stuff on my SSD ;)

推荐答案

所以你的数据并不是那么大,而且你在绘制它时遇到问题的事实表明工具存在问题.Matplotlib 有很多选项,输出也很好,但它占用了大量内存,并且从根本上假设您的数据很小.但还有其他选择.

So your data isn't that big, and the fact that you're having trouble plotting it points to issues with the tools. Matplotlib has lots of options and the output is fine, but it's a huge memory hog and it fundamentally assumes your data is small. But there are other options out there.

例如,我使用以下内容生成了一个 20M 的数据点文件bigdata.bin":

So as an example, I generated a 20M data-point file 'bigdata.bin' using the following:

#!/usr/bin/env python
import numpy
import scipy.io.numpyio

npts=20000000
filename='bigdata.bin'

def main():
    data = (numpy.random.uniform(0,1,(npts,3))).astype(numpy.float32)
    data[:,2] = 0.1*data[:,2]+numpy.exp(-((data[:,1]-0.5)**2.)/(0.25**2))
    fd = open(filename,'wb')
    scipy.io.numpyio.fwrite(fd,data.size,data)
    fd.close()

if __name__ == "__main__":
    main()

这会生成一个大约 229MB 的文件,它并不是那么大;但是您已经表示您想要处理更大的文件,因此您最终会遇到内存限制.

This generates a file of size ~229MB, which isn't all that big; but you've expressed that you'd like to go to even larger files, so you'll hit memory limits eventually.

让我们先专注于非交互式绘图.首先要意识到的是,在每个点上带有字形的矢量图将是一场灾难——对于 20 M 个点中的每一个,无论如何,大多数点都会重叠,试图渲染小十字或圆圈或其他东西成为灾难,生成巨大的文件并花费大量时间.我认为这就是默认情况下沉没 matplotlib 的原因.

Let's concentrate on non-interactive plots first. The first thing to realize is that vector plots with glyphs at each point are going to be a disaster -- for each of the 20 M points, most of which are going to overlap anyway, trying to render little crosses or circles or something is going to be a diaster, generating huge files and taking tonnes of time. This, I think is what is sinking matplotlib by default.

Gnuplot 可以轻松解决这个问题:

Gnuplot has no trouble dealing with this:

gnuplot> set term png
gnuplot> set output 'foo.png'
gnuplot> plot 'bigdata.bin' binary format="%3float32" using 2:3 with dots

甚至可以使 Matplotlib 谨慎行事(选择光栅后端,并使用像素标记点):

And even Matplotlib can be made to behave with some caution (choosing a raster back end, and using pixels to mark points):

#!/usr/bin/env python
import numpy
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

datatype=[('index',numpy.float32), ('floati',numpy.float32), 
        ('floatq',numpy.float32)]
filename='bigdata.bin'

def main():
    data = numpy.memmap(filename, datatype, 'r') 
    plt.plot(data['floati'],data['floatq'],'r,')
    plt.grid(True)
    plt.title("Signal-Diagram")
    plt.xlabel("Sample")
    plt.ylabel("In-Phase")
    plt.savefig('foo2.png')

if __name__ == "__main__":
    main()  

现在,如果您想要交互式,则必须将要绘制的数据分箱,并实时放大.我不知道有什么 python 工具可以帮助你做到这一点.

Now, if you want interactive, you're going to have to bin the data to plot, and zoom in on the fly. I don't know of any python tools that will help you do this offhand.

另一方面,绘制大数据是一项非常常见的任务,并且有一些工具可以胜任这项工作.Paraview 是我个人的最爱,VisIt 是另一个.它们都主要用于 3D 数据,但 Paraview 特别是 2d,并且非常具有交互性(甚至具有 Python 脚本界面).唯一的技巧是将数据写入 Paraview 可以轻松读取的文件格式.

On the other hand, plotting-big-data is a pretty common task, and there are tools that are up for the job. Paraview is my personal favourite, and VisIt is another one. They both are mainly for 3D data, but Paraview in particular does 2d as well, and is very interactive (and even has a Python scripting interface). The only trick will be to write the data into a file format that Paraview can easily read.

这篇关于具有约 2000 万个样本点和千兆字节数据的交互式大图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆