图表海量数据 [英] Charting massive amounts of data

查看:123
本文介绍了图表海量数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们目前正在使用ZedGraph得出一些数据的折线图。输入数据是来自一个任意大小的文件,因此,我们不知道什么数据点的提前的最大数量。然而,通过打开文件和读取头,我们能找出点有多少数据是在文件中。

We are currently using ZedGraph to draw a line chart of some data. The input data comes from a file of arbitrary size, therefore, we do not know what the maximum number of datapoints in advance. However, by opening the file and reading the header, we can find out how many data points are in the file.

文件格式基本上是[时间(双),值(双)。然而,项不在时间轴均匀的。可能没有发言权t = 0时秒和t = 10秒之间的任何点,但有可能是T = 10秒和t = 11秒时,等间100K entires

The file format is essentially [time (double), value (double)]. However, the entries are not uniform in the time axis. There may not be any points between say t = 0 sec and t = 10 sec, but there might be 100K entires between t = 10 sec and t = 11 sec, and so on.

作为一个例子,我们的测试数据集文件是〜2.6 GB和它有324M分。我们想显示整个图形给用户,并让她通过浏览图表。然而,装载了324M点ZedGraph不仅是不可能的(我们是一个32位计算机上),也没有用,因为没有在屏幕上有这么多点的点。

As an example, our test dataset file is ~2.6 GB and it has 324M points. We'd like to show the entire graph to the user and let her navigate through the chart. However, loading up 324M points to ZedGraph not only is impossible (we're on a 32-bit machine), but also not useful since there is no point of having so many points on the screen.

使用ZedGraph的FilteredPointList特征也似乎是不成问题的,因为这需要首先载入整个数据,然后对这些数据执行过滤

Using the FilteredPointList feature of ZedGraph also appears to be out of question, since that requires loading the entire data first and then performing filtering on that data.

所以,除非我们缺少什么,似乎我们唯一的解决办法是-somehow-减小数据量,但是我们坚持做这个工作,我们遇到了很多的问题:

So, unless we're missing anything, it appears that our only solution is to -somehow- decimate the data, however as we keep working on it, we're running into a lot of problems:

1-我们如何抽取一个未在时间均匀地到达的数据

1- How do we decimate data that is not arriving uniformly in time?

2-由于整个数据不能被装载入内存,任何算法需要在盘上工作,所以需要仔细设计

2- Since the entire data can't be loaded into memory, any algorithm needs to work on the disk and so needs to be designed carefully.

3-我们如何手柄放大和缩小,尤其是,当数据不在x轴上均匀

3- How do we handle zooming in and out, especially, when the data is not uniform on the x-axis.

如果数据是均匀的,在该图的初始负荷,我们可以求()在文件中条目的预定数量,并选择每N其他样品,并将其输送到ZedGraph。然而,由于数据不统一,我们必须在选择样本显示更多的智能化,我们不能拿出任何智能算法,就不必读取整个文件。

If data was uniform, upon initial load of the graph, we could Seek() by predefined amount of entries in the file, and choose every N other samples and feed it to ZedGraph. However, since the data is not uniform, we have to be more intelligent in choosing the samples to display, and we can't come up with any intelligent algorithm that would not have to read the entire file.

我道歉,因为这个问题没有锋利的特异性,但我希望我可以解释我们的问题的性质和范围。

I apologize since the question does not have razor-sharp specificity, but I hope I could explain the nature and scope of our problem.

我们是在32位Windows,.NET 4.0。

We're on Windows 32-bit, .NET 4.0.

推荐答案

我以前需要这一点,这是不容易做。我最后写的,因为这要求我自己的图形组件。它最好在年底横空出世,因为我把我们需要的所有功能。

I've needed this before, and it's not easy to do. I ended up writing my own graph component because of this requirement. It turned out better in the end, because I put in all the features we needed.

基本上你需要得到数据的范围(最小和最大可能的/需要指数值),细分成段(比方说100段),然后确定通过某种算法(平均值,中值等),每个段的值。然后你根据这些总结100个元素绘制。这是不是试图绘制数百万个点快得多: - )

Basically you need to get the range of data (min and max possible/needed index values), subdivide into segments (let's say 100 segments), and then determine a value for each segment by some algorithm (average value, median value, etc.). Then you plot based on those summarized 100 elements. This is much faster than trying to plot millions of points :-).

所以,我说的是类似于你在说什么。你提到你不想为每X元素绘制,因为有可能是时间要素之间的一长段(在x​​轴索引值)。我的意思是,对数据的每个细分决定什么是最好的价值,并把它看作数据点。我的方法是基于索引值,所以在您的0秒和10秒的指数值之间没有数据的例子,我想还是把数据点在那里,他们只是有相同的价值观彼此。

So what I am saying is similar to what you are saying. You mention you do not want to plot every X elements because there might be a long stretch of time (index values on the x axis) between elements. What I am saying is that for each subdivision of data determine what is the best value, and take that as the data point. My method is index value based, so in your example of no data between the 0 sec and 10 sec index values I would still put data points there, they would just have the same values among themselves.

关键是要汇总数据您绘制它。想通过你的算法要认真做到这一点,有很多方法可以做到这一点,选择适合您的应用程序的工作之一。

The point is to summarize the data before you plot it. Think through your algorithms to do that carefully, there are lots of ways to do so, choose the one that works for your application.

您可能会不写逃脱你的自己的图形组件,只写数据汇总算法。

You might get away with not writing your own graph component and just write the data summarization algorithm.

这篇关于图表海量数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆