具有非常大的数据集和处理放大器;只是在时间加载 [英] Dealing with very large datasets & just in time loading

查看:167
本文介绍了具有非常大的数据集和处理放大器;只是在时间加载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经写在C#(.NET 4.0).NET应用程序。在本申请中,我们必须从文件读取大量的数据集,并在网格状结构显示内容。因此,要做到这一点,我在窗体上放置一个DataGridView。它有3列,所有的列数据来自文件。最初,该文件有大约600.000记录,相当于600.000线在DataGridView。

I have a .NET application written in C# (.NET 4.0). In this application, we have to read a large dataset from a file and display the contents in a grid-like structure. So, to accomplish this, I placed a DataGridView on the form. It has 3 columns, all column data comes from the file. Initially, the file had about 600.000 records, corresponding to 600.000 lines in the DataGridView.

我很快发现,DataGridView中有如此大的数据集崩溃,所以我不得不切换到虚拟模式。要做到这一点,我第一次读文件完全进入3个不同的阵列(相当于3列),然后CellValueNeeded事件触发,我从阵列提供正确的值。

I quickly found out that, DataGridView collapses with such a large data-set, so I had switch to Virtual Mode. To accomplish this, I first read the file completely into 3 different arrays (corresponding to 3 columns), and then the CellValueNeeded event fires, I supply the correct values from the arrays.

不过,可以记录一个巨大的(巨大的!)号码在这个文件中,因为我们很快就找到了。当记录尺寸非常大,看完所有的数据到一个数组或List<>等等,似乎是不可行的。我们很快就遇到内存分配错误。 (内存溢出异常)。

However, there can be a huge (HUGE!) number of records in this file, as we quickly found out. When the record size is very large, reading all the data into an array or a List<>, etc, appears to not be feasible. We quickly run into memory allocation errors. (Out of memory exception).

我们被困在那里,但后来意识到了,为什么先读取数据皆成数组,为什么不能阅读需求CellValueNeeded事件文件火灾?所以,这就是我们现在要做的:我们打开该文件,但不读什么,而作为CellValueNeeded事件触发,我们先求()的文件在正确的位置,然后读取相应的数据。

We got stuck there, but then realized, why read the data all into arrays first, why not read the file on demand as CellValueNeeded event fires? So that's what we do now: We open the file, but do not read anything, and as CellValueNeeded events fire, we first Seek() to the correct position in the file, and then read the corresponding data.

这是我们可以拿出最好的,但是,首先这是相当缓慢,这使得应用程序缓慢,不是用户友好的。第二,我们不禁觉得必须有一个更好的方式来做到这一点。例如,一些二进制编辑器(如HXD)是瞬息万变的任何文件的大小,所以我想知道如何可以做到这一点。

This is the best we could come up with, but, first of all this is quite slow, which makes the application sluggish and not user friendly. Second, we can't help but think that there must be a better way to accomplish this. For example, some binary editors (like HXD) are blindingly fast for any file size, so I'd like know how this can be achieved.

哦,加我们的问题,在DataGridView的虚拟模式,当我们行数设置为文件中的行(说16.000.000)的可用数量,它需要一段时间在DataGridView甚至初始化。这个问题的任何意见,将不胜感激也。

Oh, and to add to our problems, in virtual mode of the DataGridView, when we set the RowCount to the available number of rows in the file (say 16.000.000), it takes a while for the DataGridView to even initialize itself. Any comments for this 'problem' would be appreciated as well.

感谢

推荐答案

如果你不适合在内存中的整个数据集,那么你需要一个缓冲方案。而不是读只是需要填写 DataGridView的响应数据量为 CellValueNeeded ,您的应用程序应该预料到用户的操作和预读。因此,例如,当程序第一次启动时,它应该读的第一个10000条记录(或者只有1000或者100,000 - 无论是合理的你的情况)。然后, CellValueNeeded 请求可以立即从内存中填写。

If you can't fit your entire data set in memory, then you need a buffering scheme. Rather than reading just the amount of data needed to fill the DataGridView in response to CellValueNeeded, your application should anticipate the user's actions and read ahead. So, for example, when the program first starts up, it should read the first 10,000 records (or maybe only 1,000 or perhaps 100,000--whatever is reasonable in your case). Then, CellValueNeeded requests can be filled immediately from memory.

当用户移动通过网格,你的程序作为尽可能保持领先的用户的一个步骤。如果用户跳跃在你前面(比如,要跳转到从前面结束),你必须走出去到磁盘,以履行请求有可能是短暂停。

As the user moves through the grid, your program as much as possible stays one step ahead of the user. There might be short pauses if the user jumps ahead of you (say, wants to jump to the end from the front) and you have to go out to disk in order to fulfill a request.

这缓冲通常最好由一个单独的线程来完成,虽然同步有时是一个问题,如果线程在用户的下一个动作的预期预读,然后用户做一些完全出乎意料,例如跳转到开始列表中。

That buffering is usually best accomplished by a separate thread, although synchronization can sometimes be an issue if the thread is reading ahead in anticipation of the user's next action, and then the user does something completely unexpected like jump to the start of the list.

1600万条记录是不是真的那么多条记录保存在内存中,除非该记录是非常大的。或者,如果你没有在服务器上的内存。当然,1600万是隔靴搔痒的最大大小列表< T> ,除非 T 是值类型(结构体)。你有多少千兆字节的数据在这里谈论?

16 million records isn't really all that many records to keep in memory, unless the records are very large. Or if you don't have much memory on your server. Certainly, 16 million is nowhere near the maximum size of a List<T>, unless T is a value type (structure). How many gigabytes of data are you talking about here?

这篇关于具有非常大的数据集和处理放大器;只是在时间加载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆