在C ++中读取大CSV文件性能问题 [英] read in large CSV file performance issue in C++

查看:134
本文介绍了在C ++中读取大CSV文件性能问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要读取许多大CSV文件才能在C ++中处理(范围从几MB到几百MB) 首先,我使用fstream打开,使用getline读取每一行,然后使用以下函数 分割每一行"

I need to read in many big CSV file to process in C++ (range from few MB to hundreds MB) At first, I open with fstream, use getline to read each line and use the following function to split each row"

template < class ContainerT >
void split(ContainerT& tokens, const std::string& str, const std::string& delimiters = " ", bool trimEmpty = false)
{
std::string::size_type pos, lastPos = 0, length = str.length();

using value_type = typename ContainerT::value_type;
using size_type = typename ContainerT::size_type;

while (lastPos < length + 1)
{
    pos = str.find_first_of(delimiters, lastPos);
    if (pos == std::string::npos)
    {
        pos = length;
    }

    if (pos != lastPos || !trimEmpty)
        tokens.push_back(value_type(str.data() + lastPos,
        (size_type)pos - lastPos));

    lastPos = pos + 1;
}
}

我尝试了boost :: split,boost :: tokenizer和boost :: sprint,发现上面给出了 迄今为止最好的表现. 之后,我考虑将整个文件读入内存进行处理,而不是保持文件打开状态, 我使用以下功能通过以下功能读取整个文件:

I tried boost::split,boost::tokenizer and boost::sprint and find the above give the best performance so far. After that, I consider read in the whole file into memory to process rather than keep the file opened, I use the following function to read in the whole file with the following function:

void ReadinFile(string const& filename, stringstream& result)
{
ifstream ifs(filename, ios::binary | ios::ate);
ifstream::pos_type pos = ifs.tellg();

//result.resize(pos);
char * buf = new char[pos];
ifs.seekg(0, ios::beg);
ifs.read(buf, pos);
result.write(buf,pos);
delete[]buf;

}

这两个函数都从网络复制到某个地方.但是,我发现 保持打开文件或读取文件之间的性能没有太大差异 整个文件. 性能捕获如下:

Both functions are copied somewhere from the net. However, I find that there is not much difference in performance between keep file opened or read in the whole file. The performance capture as follow:

Process 2100 files with boost::split (without read in whole file) 832 sec
Process 2100 files with custom split (without read in whole file) 311 sec
Process 2100 files with custom split (read in whole file) 342 sec

请在下面找到一种类型文件的样本内容,我有6种类型需要处理.但是所有都是相似的.

Below please find the sample content of one type of file(s), I have 6 types need to handle. But all are similar.

a1,1,1,3.5,5,1,1,1,0,0,6,0,155,21,142,22,49,1,9,1,0,0,0,0,0,0,0
a1,10,2,5,5,1,1,2,0,0,12,0,50,18,106,33,100,29,45,9,8,0,1,1,0,0,0
a1,19,3,5,5,1,1,3,0,0,18,0,12,12,52,40,82,49,63,41,23,16,8,2,0,0,0
a1,28,4,5.5,5,1,1,4,0,0,24,0,2,3,17,16,53,53,63,62,43,44,18,22,4,0,4
a1,37,5,3,5,1,1,5,0,0,6,0,157,22,129,18,57,11,6,0,0,0,0,0,0,0,0
a1,46,6,4.5,5,1,1,6,0,0,12,0,41,19,121,31,90,34,37,15,6,4,0,2,0,0,0
a1,55,7,5.5,5,1,1,7,0,0,18,0,10,9,52,36,86,43,67,38,31,15,5,7,1,0,1
a1,64,8,5.5,5,1,1,8,0,0,24,0,0,3,18,23,44,55,72,57,55,43,8,19,1,2,3
a1,73,9,3.5,5,1,1,9,1,0,6,0,149,17,145,21,51,8,8,1,0,0,0,0,0,0,0
a1,82,10,4.5,5,1,1,10,1,0,12,0,47,17,115,35,96,36,32,10,8,3,1,0,0,0,0

我的问题是:

1为什么读取整个文件会比不读取整个文件性能差?

1 Why read in whole file will perform worse than not read in whole file ?

2还有其他更好的字符串拆分功能吗?

2 Any other better string split function?

3 ReadinFile函数需要先读取缓冲区,然后再写入字符串流进行处理, 有什么方法可以避免这种情况?即直接进入stringstream

3 The ReadinFile function need to read to a buffer and then write to a stringstream to process, any method to avoid this ? i.e. directly into stringstream

4我需要使用getline来解析每行(用\ n),并使用split来标记每一行, 任何类似于getline的函数都适用于string吗?例如getline_str?以便 我可以直接读成字符串

4 I need to use getline to parse each line (with \n) and use split to tokenize each row, any function similar for getline for string ? e.g. getline_str ? so that I can read into string directly

5如何将整个文件读取为一个字符串,然后将整个字符串分割为带有'\ n'的向量,然后将向量中的每个字符串分割为带有','的处理?这会更好吗?字符串的限制(最大大小)是多少?

5 How about read the whole file into a string and then split the whole string into vector with '\n' and then split each string in vector with ',' to process ? Will this perform better ? And what is the limit (max size) of string ?

6或者我应该定义一个这样的结构(基于格式)

6 Or I should define a struct like this (based on the format)

struct MyStruct {
  string Item1;
  int It2_3[2];
  float It4;
  int ItRemain[23];
};

并直接读入向量?该怎么做?

and read directly into a vector ? How to do this ?

非常感谢.

Regts

林志峰

推荐答案

每当您需要关注性能时,最好尝试使用替代方法并评估其性能.一些帮助实现您在下面的问题中询问的一个选项....

Whenever you have to care about performance, it's good to try alternatives and measure their performance. Some help implementing one option you ask about in your question below....

给出您要阅读的每个结构,例如您的示例...

Given each structure you want to read, such as your example...

struct MyStruct {
  string Item1;
  int It2_3[2];
  float It4;
  int ItRemain[23];
};

......您可以使用 fscanf .不幸的是,它是一个不支持std::string的C库函数,因此您需要为每个字符串字段创建一个字符数组缓冲区,然后从那里复制到结构的字段.全部,像这样:

...you can read and parse the fields using fscanf. Unfortunately, it's a C library function that doesn't support std::strings, so you'll need to a create character array buffer for each string field then copy from there to your structure's field. All up, something like:

char Item1[4096];
MyStruct m;
std::vector<MyStruct> myStructs;
FILE* stream = fopen(filename, "r");
assert(stream);
while (fscanf(stream, "%s,%d,%d,%f,%d,%d,%d,%d...",
              Item1, &m.It2_3[0], &m.It2_3[1], &m.It4,
              &m.ItRemain[0], &m.ItRemain[1], &m.ItRemain[2], ...) == 27)
{
    myStructs.push_back(m);
    myStructs.back().Item1 = Item1;  // fix the std::strings
}
fclose(stream);

(只需将正确数量的%d放入格式字符串中,并完成其他ItRemain索引).

(just put the right number of %ds in the format string and complete the other ItRemain indices).

我不愿意推荐它,因为它可能是您可能会遇到的更高级的编程,但是内存映射文件和编写自己的解析程序的可能性比上面的fscanf方法高出好几倍(但是再次,直到在您的硬件上对其进行测量后,您才知道).如果您是一位试图做严肃工作的科学家,则可以与专业程序员一起为您完成这项工作.

Separately, I'm relucatant to recommend it as it's more advanced programming you may struggle with, but memory mapping the file and writing your own parsing has a good chance of being several times than the fscanf approach above (but again, you won't know until it's measured on your hardware). If you're a scientist trying to do something serious, maybe pair with a professional programmer to get this done for you.

这篇关于在C ++中读取大CSV文件性能问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆