在C ++中将大型数据向量写入/读取到二进制文件 [英] Writing/reading large vectors of data to binary file in c++

查看:82
本文介绍了在C ++中将大型数据向量写入/读取到二进制文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个c ++程序,该程序通过将栅格化的人口数据从ascii文件读取到大的8640x3432双精度元素向量中来计算给定半径内的人口.将ascii数据读入向量大约需要30秒(遍历每一列和每一行),而程序的其余部分仅需几秒钟.我被要求通过将填充数据写入二进制文件来加快此过程,该文件应该可以更快地读取.

I have a c++ program that computes populations within a given radius by reading gridded population data from an ascii file into a large 8640x3432-element vector of doubles. Reading the ascii data into the vector takes ~30 seconds (looping over each column and each row), while the rest of the program only takes a few seconds. I was asked to speed up this process by writing the population data to a binary file, which would supposedly read in faster.

ascii数据文件具有一些标头行,这些标头行提供了一些数据规格,例如列数和行数,其后是每个网格单元的填充数据,其格式为3432行,共8640个数字,以空格分隔.人口数据号是混合格式,可以是0,十进制值(0.000685648)或科学计数法值(2.687768e-05).

The ascii data file has a few header rows that give some data specs like the number of columns and rows, followed by population data for each grid cell, which is formatted as 3432 rows of 8640 numbers, separated by spaces. The population data numbers are mixed formats and can be just 0, a decimal value (0.000685648), or a value in scientific notation (2.687768e-05).

我发现了一些读取/写入包含二进制矢量的结构的示例,并尝试实现类似的功能,但是遇到了问题.当我在同一个程序中都向二进制文件写入向量或从二进制文件读取向量时,它似乎可以工作并为我提供了所有正确的值,但随后以段错误:11"或内存分配错误结尾未分配释放的指针".而且,如果我尝试从先前写入的二进制文件中读取数据(而不在同一程序运行中重新写入数据),那么它给我的标头变量就很好了,但是在给我矢量数据之前给了我一个段错误.

I found a few examples of reading/writing structs containing vectors to binary, and tried to implement something similar, but am running into problems. When I both write and read the vector to/from the binary file in the same program, it seems to work and gives me all the correct values, but then it ends with either a "segment fault: 11" or a memory allocation error that a "pointer being freed was not allocated". And if I try to just read the data in from the previously written binary file (without re-writing it in the same program run), then it gives me the header variables just fine but gives me a segfault before giving me the vector data.

任何有关我可能做错了事的建议,或以更好的方式做到这一点的建议,将不胜感激!我正在Mac上编译和运行,并且目前没有boost或其他非标准库.(注意:我是编码方面的新手,必须深入学习,所以我可能缺少许多基本概念和术语-抱歉!).

Any advice on what I might have done wrong, or on a better way to do this would be greatly appreciated! I am compiling and running on a mac, and I don't have boost or other non-standard libraries at present. (Note: I am extremely new at coding and am having to learn by jumping in the deep end, so I may be missing a lot of basic concepts and terminology -- sorry!).

这是我想出的代码:

# include <stdio.h>
# include <stdlib.h>
# include <string.h>
# include <fstream>
# include <iostream>
# include <vector>
# include <string.h>

using namespace std;

//Define struct for population file data and initialize one struct variable for reading in ascii (A) and one for reading in binary (B)
struct popFileData
{
    int nRows, nCol;
    vector< vector<double> > popCount; //this will end up having 3432x8640 elements
} popDataA, popDataB;

int main() {

    string gridFname = "sample";

    double dum;
    vector<double> tempVector;

    //open ascii population grid file to stream
    ifstream gridFile;
    gridFile.open(gridFname + ".asc");

    int i = 0, j = 0;

    if (gridFile.is_open())
    {
        //read in header data from file
        string fileLine;
        gridFile >> fileLine >> popDataA.nCol;
        gridFile >> fileLine >> popDataA.nRows;

        popDataA.popCount.clear();

        //read in vector data, point-by-point
        for (i = 0; i < popDataA.nRows; i++)
        {
            tempVector.clear();

            for (j = 0; j<popDataA.nCol; j++)
            {
                gridFile >> dum;
                tempVector.push_back(dum);
            }
            popDataA.popCount.push_back(tempVector);
        }
        //close ascii grid file
        gridFile.close();
    }
    else
    {
        cout << "Population file read failed!" << endl;
    }

    //create/open binary file
    ofstream ofs(gridFname + ".bin", ios::trunc | ios::binary);
    if (ofs.is_open())
    {
        //write struct to binary file then close binary file
        ofs.write((char *)&popDataA, sizeof(popDataA));
        ofs.close();
    }
    else cout << "error writing to binary file" << endl;

    //read data from binary file into popDataB struct
    ifstream ifs(gridFname + ".bin", ios::binary);
    if (ifs.is_open())
    {
        ifs.read((char *)&popDataB, sizeof(popDataB));
        ifs.close();
    }
    else cout << "error reading from binary file" << endl;

    //compare results of reading in from the ascii file and reading in from the binary file
    cout << "File Header Values:\n";
    cout << "Columns (ascii vs binary): " << popDataA.nCol << " vs. " << popDataB.nCol << endl;
    cout << "Rows (ascii vs binary):" << popDataA.nRows << " vs." << popDataB.nRows << endl;

    cout << "Spot Check Vector Values: " << endl;
    cout << "Index 0,0: " << popDataA.popCount[0][0] << " vs. " << popDataB.popCount[0][0] << endl;
    cout << "Index 3431,8639: " << popDataA.popCount[3431][8639] << " vs. " << popDataB.popCount[3431][8639] << endl;
    cout << "Index 1600,4320: " << popDataA.popCount[1600][4320] << " vs. " << popDataB.popCount[1600][4320] << endl;

    return 0;
}

这是我在同一运行中读写二进制文件时的输出:

Here is the output when I both write and read the binary file in the same run:

File Header Values:
Columns (ascii vs binary): 8640 vs. 8640
Rows (ascii vs binary):3432 vs.3432
Spot Check Vector Values: 
Index 0,0: 0 vs. 0
Index 3431,8639: 0 vs. 0
Index 1600,4320: 25.2184 vs. 25.2184
a.out(11402,0x7fff77c25310) malloc: *** error for object 0x7fde9821c000: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
Abort trap: 6

如果我只是尝试从预先存在的二进制文件中读取,这就是我得到的输出:

And here is the output I get if I just try to read from the pre-existing binary file:

File Header Values:
Columns (binary): 8640
Rows (binary):3432
Spot Check Vector Values: 
Segmentation fault: 11

在此先感谢您的帮助!

推荐答案

popDataA 写入文件时,就是在编写向量的二进制表示形式.但是,这实际上是一个很小的对象,由指向实际数据的指针(在这种情况下,本身是一系列矢量)和一些大小信息组成.

When you write popDataA to the file, you are writing the binary representation of the vector of vectors. However this really is quite a small object, consisting of a pointer to the actual data (itself a series of vectors, in this case) and some size information.

当将其读回 popDataB 时,就可以了!但这只是因为 popDataA 中的原始指针现在位于 popDataB 中,并且它指向内存中的相同内容.最后事情变得疯狂,因为释放向量的内存时,代码尝试两次释放 popDataA 所引用的数据(一次是 popDataA 所释放的)表示 popDataB .)

When it's read back in to popDataB, it kinda works! But only because the raw pointer that was in popDataA is now in popDataB, and it points to the same stuff in memory. Things go crazy at the end, because when the memory for the vectors is freed, the code tries to free the data referenced by popDataA twice (once for popDataA, and once again for popDataB.)

简短的版本是,以这种方式向文件写入向量是不合理的.

The short version is, it's not a reasonable thing to write a vector to a file in this fashion.

那该怎么办?最好的方法是首先确定您的数据表示形式.它将像ASCII格式一样,指定将什么值写入何处,并将包括有关矩阵大小的信息,以便您知道在读取向量时需要分配多大的向量.

So what to do? The best approach is to first decide on your data representation. It will, like the ASCII format, specify what value gets written where, and will include information about the matrix size, so that you know how large a vector you will need to allocate when reading them in.

在半伪代码中,编写过程类似于:

In semi-pseudo code, writing will look something like:

int nrow=...;
int ncol=...;
ofs.write((char *)&nrow,sizeof(nrow));
ofs.write((char *)&ncol,sizeof(ncol));
for (int i=0;i<nrow;++i) {
    for (int j=0;j<ncol;++j) {
        double val=data[i][j];
        ofs.write((char *)&val,sizeof(val));
    }
}

与阅读相反:

ifs.read((char *)&nrow,sizeof(nrow));
ifs.read((char *)&ncol,sizeof(ncol));
// allocate data-structure of size nrow x ncol
// ...
for (int i=0;i<nrow;++i) {
    for (int j=0;j<ncol;++j) {
        double val;
        ifs.read((char *)&val,sizeof(val));
        data[i][j]=val;
    }
}

尽管如此,您应该考虑不要将内容写入这样的二进制文件中.这些特殊的二进制格式趋向于生存,远远超出其预期的效用,并且容易遭受:

All that said though, you should consider not writing things into a binary file like this. These sorts of ad hoc binary formats tend to live on, long past their anticipated utility, and tend to suffer from:

  • 缺乏文档
  • 缺乏可扩展性
  • 格式更改不包含版本信息
  • 在不同计算机上使用保存的数据时出现问题,包括字节序问题,整数的默认大小不同等.

相反,我强烈建议您使用第三方库.对于科学数据,HDF5和netcdf4是很好的选择,可以为您解决上述所有问题,并且附带了可以检查数据而又不了解您的特定程序的工具.

Instead, I would strongly recommend using a third-party library. For scientific data, HDF5 and netcdf4 are good choices which address all of the above issues for you, and come with tools that can inspect the data without knowing anything about your particular program.

轻量级选项包括Boost序列化库和Google的协议缓冲区,但是这些仅解决了上面列出的一些问题.

Lighter-weight options include the Boost serialization library and Google's protocol buffers, but these address only some of the issues listed above.

这篇关于在C ++中将大型数据向量写入/读取到二进制文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆