海量数据图表化 [英] Charting massive amounts of data

查看:33
本文介绍了海量数据图表化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们目前正在使用 ZedGraph 绘制一些数据的折线图.输入数据来自任意大小的文件,因此,我们事先不知道最大数据点数是多少.但是,通过打开文件并读取标题,我们可以找出文件中有多少个数据点.

We are currently using ZedGraph to draw a line chart of some data. The input data comes from a file of arbitrary size, therefore, we do not know what the maximum number of datapoints in advance. However, by opening the file and reading the header, we can find out how many data points are in the file.

文件格式本质上是[时间(双),值(双)].但是,条目在时间轴上并不统一.在 t = 0 秒和 t = 10 秒之间可能没有任何点,但在 t = 10 秒和 t = 11 秒之间可能有 100K 个整数,依此类推.

The file format is essentially [time (double), value (double)]. However, the entries are not uniform in the time axis. There may not be any points between say t = 0 sec and t = 10 sec, but there might be 100K entires between t = 10 sec and t = 11 sec, and so on.

例如,我们的测试数据集文件约为 2.6 GB,有 324M 点.我们希望向用户显示整个图表,并让她浏览图表.然而,将 324M 点加载到 ZedGraph 不仅是不可能的(我们在 32 位机器上),而且也没有用,因为屏幕上没有这么多点.

As an example, our test dataset file is ~2.6 GB and it has 324M points. We'd like to show the entire graph to the user and let her navigate through the chart. However, loading up 324M points to ZedGraph not only is impossible (we're on a 32-bit machine), but also not useful since there is no point of having so many points on the screen.

使用 ZedGraph 的 FilteredPointList 功能似乎也没有问题,因为这需要先加载整个数据,然后对这些数据进行过滤.

Using the FilteredPointList feature of ZedGraph also appears to be out of question, since that requires loading the entire data first and then performing filtering on that data.

因此,除非我们遗漏任何内容,否则我们唯一的解决方案似乎是以某种方式对数据进行抽取,但是随着我们继续努力,我们遇到了很多问题:

So, unless we're missing anything, it appears that our only solution is to -somehow- decimate the data, however as we keep working on it, we're running into a lot of problems:

1- 我们如何抽取未及时到达的数据?

1- How do we decimate data that is not arriving uniformly in time?

2- 由于无法将整个数据加载到内存中,因此任何算法都需要在磁盘上运行,因此需要仔细设计.

2- Since the entire data can't be loaded into memory, any algorithm needs to work on the disk and so needs to be designed carefully.

3- 我们如何处理放大和缩小,尤其是当数据在 x 轴上不均匀时.

3- How do we handle zooming in and out, especially, when the data is not uniform on the x-axis.

如果数据是统一的,在图形的初始加载时,我们可以Seek()通过文件中预定义的条目数量,然后每隔 N 选择一个样本并将其提供给 ZedGraph.但是,由于数据不统一,我们在选择展示的样本时必须更加智能,我们无法想出任何不需要读取整个文件的智能算法.

If data was uniform, upon initial load of the graph, we could Seek() by predefined amount of entries in the file, and choose every N other samples and feed it to ZedGraph. However, since the data is not uniform, we have to be more intelligent in choosing the samples to display, and we can't come up with any intelligent algorithm that would not have to read the entire file.

我很抱歉,因为这个问题没有尖锐的具体性,但我希望我能解释我们问题的性质和范围.

I apologize since the question does not have razor-sharp specificity, but I hope I could explain the nature and scope of our problem.

我们使用的是 Windows 32 位、.NET 4.0.

We're on Windows 32-bit, .NET 4.0.

推荐答案

我以前需要这个,但做起来并不容易.由于这个要求,我最终编写了自己的图形组件.最终结果更好,因为我加入了我们需要的所有功能.

I've needed this before, and it's not easy to do. I ended up writing my own graph component because of this requirement. It turned out better in the end because I put in all the features we needed.

基本上,您需要获取数据范围(最小和最大可能/需要的索引值),将其细分为段(假设为 100 个段),然后通过某种算法确定每个段的值(平均值,中值等).然后根据这些汇总的 100 个元素进行绘图.这比尝试绘制数百万个点要快得多:-).

Basically, you need to get the range of data (min and max possible/needed index values), subdivide it into segments (let's say 100 segments), and then determine a value for each segment by some algorithm (average value, median value, etc.). Then you plot based on those summarized 100 elements. This is much faster than trying to plot millions of points :-).

所以我说的和你说的很相似.您提到您不想绘制每个 X 元素,因为元素之间可能有很长的时间(x 轴上的索引值).我要说的是,对于数据的每个细分,确定什么是最佳值,并将其作为数据点.我的方法是基于索引值的,因此在您的 0 秒和 10 秒索引值之间没有数据的示例中,我仍然会将数据点放在那里,它们之间只会具有相同的值.关键是在绘制数据之前总结数据.仔细考虑你的算法来做到这一点,有很多方法可以做到,选择适合你的应用程序的方法.您可能不必编写自己的图形组件而只需编写数据汇总算法.

So what I am saying is similar to what you are saying. You mention you do not want to plot every X element because there might be a long stretch of time (index values on the x-axis) between elements. What I am saying is that for each subdivision of data determine what is the best value, and take that as the data point. My method is index value-based, so in your example of no data between the 0 sec and 10-sec index values I would still put data points there, they would just have the same values among themselves. The point is to summarize the data before you plot it. Think through your algorithms to do that carefully, there are lots of ways to do so, choose the one that works for your application. You might get away with not writing your own graph component and just write the data summarization algorithm.

这篇关于海量数据图表化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆