具有约2000万个采样点和千兆字节数据的交互式大图 [英] Interactive large plot with ~20 million sample points and gigabytes of data

查看:73
本文介绍了具有约2000万个采样点和千兆字节数据的交互式大图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用RAM时遇到问题:它无法保存要绘制的数据.我确实有足够的高清空间.有什么解决方案可以避免我的数据集出现阴影"?

I have got a problem (with my RAM) here: it's not able to hold the data I want to plot. I do have sufficient HD space. Is there any solution to avoid that "shadowing" of my data-set?

我确实要处理数字信号处理,因此必须使用高采样率.我的框架(GNU Radio)将值(以避免使用过多的磁盘空间)保存为二进制.我打开包装.之后,我需要绘图.我需要剧情可缩放,并且是交互式的.这是一个问题.

Concretely I deal with Digital Signal Processing and I have to use a high sample-rate. My framework (GNU Radio) saves the values (to avoid using too much disk space) in binary. I unpack it. Afterwards I need to plot. I need the plot zoomable, and interactive. And that is an issue.

此功能或其他可以处理较大数据集的软件/编程语言(例如R或类似语言)是否有优化的潜力?实际上,我想要更多的数据.但是我没有其他软件的经验. GNUplot失败,采用与以下类似的方法.我不知道R(喷射).

Is there any optimization potential to this, or another software/programming language (like R or so) which can handle larger data-sets? Actually I want much more data in my plots. But I have no experience with other software. GNUplot fails, with a similar approach to the following. I don't know R (jet).

import matplotlib.pyplot as plt
import matplotlib.cbook as cbook
import struct

"""
plots a cfile

cfile - IEEE single-precision (4-byte) floats, IQ pairs, binary
txt - index,in-phase,quadrature in plaintext

note: directly plotting with numpy results into shadowed functions
"""

# unpacking the cfile dataset
def unpack_set(input_filename, output_filename):
    index = 0   # index of the samples
    output_filename = open(output_filename, 'wb')

    with open(input_filename, "rb") as f:

        byte = f.read(4)    # read 1. column of the vector

        while byte != "":
        # stored Bit Values
            floati = struct.unpack('f', byte)   # write value of 1. column to a variable
            byte = f.read(4)            # read 2. column of the vector
            floatq = struct.unpack('f', byte)   # write value of 2. column to a variable
            byte = f.read(4)            # next row of the vector and read 1. column
            # delimeter format for matplotlib 
            lines = ["%d," % index, format(floati), ",",  format(floatq), "\n"]
            output_filename.writelines(lines)
            index = index + 1
    output_filename.close
    return output_filename.name

# reformats output (precision configuration here)
def format(value):
    return "%.8f" % value            

# start
def main():

    # specify path
    unpacked_file = unpack_set("test01.cfile", "test01.txt")
    # pass file reference to matplotlib
    fname = str(unpacked_file)
    plt.plotfile(fname, cols=(0,1)) # index vs. in-phase

    # optional
    # plt.axes([0, 0.5, 0, 100000]) # for 100k samples
    plt.grid(True)
    plt.title("Signal-Diagram")
    plt.xlabel("Sample")
    plt.ylabel("In-Phase")

    plt.show();

if __name__ == "__main__":
    main()

像plt.swap_on_disk()之类的东西可以将这些东西缓存在我的SSD上;)

Something like plt.swap_on_disk() could cache the stuff on my SSD ;)

推荐答案

所以您的数据不是那么大,并且您在绘制数据时遇到了麻烦,这表明工具存在问题. Matplotlib有很多选项,输出也很好,但是这是一个巨大的内存消耗,它从根本上假设您的数据很小.但是还有其他选择.

So your data isn't that big, and the fact that you're having trouble plotting it points to issues with the tools. Matplotlib has lots of options and the output is fine, but it's a huge memory hog and it fundamentally assumes your data is small. But there are other options out there.

例如,我使用以下命令生成了一个20M数据点文件"bigdata.bin":

So as an example, I generated a 20M data-point file 'bigdata.bin' using the following:

#!/usr/bin/env python
import numpy
import scipy.io.numpyio

npts=20000000
filename='bigdata.bin'

def main():
    data = (numpy.random.uniform(0,1,(npts,3))).astype(numpy.float32)
    data[:,2] = 0.1*data[:,2]+numpy.exp(-((data[:,1]-0.5)**2.)/(0.25**2))
    fd = open(filename,'wb')
    scipy.io.numpyio.fwrite(fd,data.size,data)
    fd.close()

if __name__ == "__main__":
    main()

这将生成一个〜229MB的文件,这个文件并不大.但是您已经表示过,您想转到更大的文件,所以最终会遇到内存限制.

This generates a file of size ~229MB, which isn't all that big; but you've expressed that you'd like to go to even larger files, so you'll hit memory limits eventually.

让我们首先关注非交互式情节.首先要意识到的是,在每个点上都带有字形的向量图将是一场灾难-对于20 M个点中的每一个,无论如何大多数都会重叠,试图绘制很少的十字或圆圈,或者正在发生变化成为灾难,产生巨大的文件并耗费大量时间.我认为这是默认情况下下沉matplotlib的原因.

Let's concentrate on non-interactive plots first. The first thing to realize is that vector plots with glyphs at each point are going to be a disaster -- for each of the 20 M points, most of which are going to overlap anyway, trying to render little crosses or circles or something is going to be a diaster, generating huge files and taking tonnes of time. This, I think is what is sinking matplotlib by default.

Gnuplot可以轻松解决此问题:

Gnuplot has no trouble dealing with this:

gnuplot> set term png
gnuplot> set output 'foo.png'
gnuplot> plot 'bigdata.bin' binary format="%3float32" using 2:3 with dots

甚至可以使Matplotlib表现得谨慎(选择栅格后端,并使用像素标记点):

And even Matplotlib can be made to behave with some caution (choosing a raster back end, and using pixels to mark points):

#!/usr/bin/env python
import numpy
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

datatype=[('index',numpy.float32), ('floati',numpy.float32), 
        ('floatq',numpy.float32)]
filename='bigdata.bin'

def main():
    data = numpy.memmap(filename, datatype, 'r') 
    plt.plot(data['floati'],data['floatq'],'r,')
    plt.grid(True)
    plt.title("Signal-Diagram")
    plt.xlabel("Sample")
    plt.ylabel("In-Phase")
    plt.savefig('foo2.png')

if __name__ == "__main__":
    main()  

现在,如果要进行交互,则将需要对数据进行分类以进行绘制,并即时进行放大.我不知道有任何Python工具可以帮助您立即完成此操作.

Now, if you want interactive, you're going to have to bin the data to plot, and zoom in on the fly. I don't know of any python tools that will help you do this offhand.

另一方面,绘制大数据是很常见的任务,并且有一些工具可以胜任这项工作. Paraview 是我个人的最爱,而

On the other hand, plotting-big-data is a pretty common task, and there are tools that are up for the job. Paraview is my personal favourite, and VisIt is another one. They both are mainly for 3D data, but Paraview in particular does 2d as well, and is very interactive (and even has a Python scripting interface). The only trick will be to write the data into a file format that Paraview can easily read.

这篇关于具有约2000万个采样点和千兆字节数据的交互式大图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆