有效地从C / C的结构化文件中读取数据++ [英] Efficiently read data from a structured file in C/C++

查看:237
本文介绍了有效地从C / C的结构化文件中读取数据++的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件如下:

在这里输入的形象描述

该文件由两部分组成:头和数据

The file consists of 2 parts: header and data.

数据部分被分成同样大小的页面。每一页保存一个具体的指标。多页(不必是连续的)可能需要的单个指标的保存数据。每个页面都包含一个页眉和页的身体。页面标题有一个名为下一页字段是对同一指标保持数据的下一个页面的索引。一个页面机身拥有真实的数据。所有页面具有相同的&放大器;固定大小(20字节的标头和供体800个字节(如果数据量小于800个字节,0将被填充))。

The data part is separated into equally sized pages. Each page holds data for a specific metric. Multiple pages (needs not to be consecutive) might be needed to hold data for a single metric. Each page consists of a page header and a page body. A page header has a field called "Next page" that is the index of the next page that holds data for the same metric. A page body holds real data. All pages have the same & fixed size (20 bytes for header and 800 bytes for body (if data amount is less than 800 bytes, 0 will be filled)).

头部分包括20000元件,每个元件具有约一个特定度量信息(点1 - >点20000)。一个元素有一个名为第一页字段,实际上是Metric的第一页保持数据的索引。

The header part consists of 20,000 elements, each element has information about a specific metric (point 1 -> point 20000). An element has a field called "first page" that is actually index of the first page holding data for the metric.

该文件可以达到10 GB。

The file can be up to 10 GB.

要求:重新才能在最短的时间内文件的数据,也就是保存数据单个指标的页面必须按字母顺序是连续的,并从公制1公制20000(头部分必须相应地进行更新)

Requirement: Re-order data of the file in the shortest time, that is, pages holding data for a single metric must be consecutive, and from metric 1 to metric 20000 according to alphabet order (header part must be updated accordingly).

这是显而易见的方法:对于每一个指标,阅读的指标(逐页)的所有数据,将数据写入新文件。但是,这需要大量的时间,从文件中读取数据时尤其如此。

An apparent approach: For each metric, read all data for the metric (page by page), write data to new file. But this takes much time, especially when reading data from the file.

有没有什么有效的办法?

Is there any efficient ways?

推荐答案

我第一次读头部分,然后进行排序按字母顺序排列的指标。在排序列表中的每个指标,我从输入文件读取的所有数据并写入到输出文件。在读取数据的步骤消除瓶颈,我使用的内存映射。结果表明,使用存储器映射为5 GB的输入文件时的执行时间是当不使用存储器映射相比减小5〜6次。这种方式暂时解决我的问题。不过,我也会考虑@utnapistim的建议。

I first read header part, then sort metrics in alphabetic order. For each metric in the sorted list I read all data from the input file and write to the output file. To remove bottlenecks at reading data step, I used memory mapping. The results showed that when using memory mapping the execution time for an input file of 5 GB was reduced 5 ~ 6 times compared with when not using memory mapping. This way temporarily solve my problems. However, I will also consider suggestions of @utnapistim.

这篇关于有效地从C / C的结构化文件中读取数据++的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆