非常快的文本文件处理(C ++) [英] very fast text file processing (C++)

查看:126
本文介绍了非常快的文本文件处理(C ++)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个应用程序来处理GPU上的数据。代码工作得很好,但我有一个问题,输入文件(〜3GB,文本)的阅读部分是我的应用程序的瓶颈。



我使用getline()读取一行,并将第1行复制到一个向量中, line2到向量,跳过第3行和第4行。对于其余的11个mio行,依此类推。



我尝试了几种方法来获得最佳时间可能:



我发现的最快方法是使用boost :: iostreams :: stream



p>


  • 将文件读取为gzip,以最小化IO,但比读取它的
    要慢。

  • 通过读取(filepointer,chararray,length)将文件复制到ram
    ,并使用循环处理它以区分各行(也比boost慢)



任何建议如何使其运行速度更快?

  void readfastq(char * filename,int SRlength,uint32_t blocksize){
_filelength = 0; // total datasets(each 4 lines)
_SRlength = SRlength; // length of the line。
_blocksize = blocksize;

boost :: iostreams :: stream< boost :: iostreams :: file_source> ins(filename);
in = ins;

readNextBlock();
}


void readNextBlock(){
timeval start,end;
gettimeofday(& start,0);

string name;
string seqtemp;
string garbage;
string phredtemp;

_seqs.empty();
_phred.empty();
_names.empty();
_filelength = 0;

//只读文件的一部分,即第一个4mio行
while(std :: getline(in,name)&& _filelength< _blocksize){
std :: getline(in,seqtemp);
std :: getline(in,garbage);
std :: getline(in,phredtemp);

if(seqtemp.size()!= _SRlength){
if(seqtemp.size()!= 0)
printf(读取时出错fastq:size is invalid\\\
);
} else {
_names.push_back(name);

for(int k = 0; k <_SRlength; k ++){

//处理特殊字母
if(seqtemp [k] =='A ')...
else {
_seqs.push_back(5);
}

}
_filelength ++;
}
}

EDIT: p>

源文件可以在 https:/ /docs.google.com/open?id=0B5bvyb427McSMjM2YWQwM2YtZGU2Mi00OGVmLThkODAtYzJhODIzYjNhYTY2



由于某些指针问题,我更改了函数readfastq以读取文件。因此,如果调用 readfastq ,则 blocksize (以行为单位)必须大于要读取的行数。 p>

解决方案:



我找到了一个解决方案,文件从60秒到16秒。我删除了处理特殊字符的内循环,并在GPU中这样做。



感谢您的建议。

  void readfastq(char * filename,int SRlength){
_filelength = 0;
_SRlength = SRlength;

size_t bytes_read,bytes_expected;

FILE * fp;
fp = fopen(filename,r);

fseek(fp,0L,SEEK_END); // go to the end of file
bytes_expected = ftell(fp); // get fileize
fseek(fp,0L,SEEK_SET); //转到文件的开头

fclose(fp);

if((_seqarray =(char *)malloc(bytes_expected / 2))== NULL)//为文件分配空间
err(EX_OSERR,data malloc);


字符串名称
string seqtemp;
string garbage;
string phredtemp;

boost :: iostreams :: stream< boost :: iostreams :: file_source> file(filename);


while(std :: getline(file,name)){
std :: getline(file,seqtemp);
std :: getline(file,garbage);
std :: getline(file,phredtemp);

if(seqtemp.size()!= SRlength){
if(seqtemp.size()!= 0)
printf( invalid\\\
);
} else {
_names.push_back(name);

strncpy(&(_ seqarray [SRlength * _filelength]),seqtemp.c_str(),seqtemp.length()); //不处理特殊字母这里,做在GPU上

_filelength ++;
}
}
}


解决方案>

首先,而不是读取文件到内存中,您可以使用文件映射。您只需将程序构建为64位,以适应3GB的虚拟地址空间(对于32位应用程序,只有2GB可在用户模式下访问)。或者,您可以映射&



接下来,听起来你的瓶颈是复制一行到一个向量。处理向量涉及动态内存分配(堆操作),这在关键循环中非常严重地影响性能)。如果是这样 - 避免使用向量,或者确保它们在循环外声明。后者有助于因为当你重新分配/清除向量时,它们不会释放内存。



发布你的代码建议。



编辑



似乎所有的瓶颈都与字符串管理。




  • std :: getline(in,seqtemp); std :: string 处理动态内存分配。

  • _names.push_back(name); 这更糟糕。首先, std :: string 通过 value 放置在向量中。平均值 - 字符串被复制,因此发生另一个动态分配/释放。此外,当最终向量在内部重新分配 - 所有包含的字符串被再次复制,带来所有后果。



我建议既不使用标准格式的文件I / O函数(Stdio / STL)也不使用 std :: string 。为了实现更好的性能,你应该使用指向字符串(而不是复制字符串)的指针,这是可能的,如果你映射整个文件。



像这样的代码:

  class MemoryMappedFileParser 
{
const char * m_sz;
size_t m_Len;

public:

struct String {
const char * m_sz;
size_t m_Len;
};

bool getline(String& out)
{
out.m_sz = m_sz;

const char * sz =(char *)memchr(m_sz,'\\\
',m_Len);
if(sz)
{
size_t len = sz - m_sz;

m_sz = sz + 1;
m_Len - =(len + 1);

out.m_Len = len;

//对于Windows格式的文本文件,删除'\r'以及
if(len&&'\r'== out.m_sz [len-1 ])
out.m_Len--;
} else
{
out.m_Len = m_Len;

if(!m_Len)
return false;

m_Len = 0;
}

return true;
}

};


i wrote an application which processes data on the GPU. Code works well, but i have the problem that the reading part of the input file (~3GB, text) is the bottleneck of my application. (The read from the HDD is fast, but the processing line by line is slow).

I read a line with getline() and copy line 1 to a vector, line2 to a vector and skip lines 3 and 4. And so on for the rest of the 11 mio lines.

I tried several approaches to get the file at the best time possible:

Fastest method I found is using boost::iostreams::stream

Others were:

  • Read the file as gzip, to minimize IO, but is slower than directly reading it.
  • copy file to ram by read(filepointer, chararray, length) and process it with a loop to distinguish the lines (also slower than boost)

Any suggestions how to make it run faster?

void readfastq(char *filename, int SRlength, uint32_t blocksize){
    _filelength = 0; //total datasets (each 4 lines)
    _SRlength = SRlength; //length of the 2. line
    _blocksize = blocksize;

    boost::iostreams::stream<boost::iostreams::file_source>ins(filename);
    in = ins;

    readNextBlock();
}


void readNextBlock() {
    timeval start, end;
    gettimeofday(&start, 0);

    string name;
    string seqtemp;
    string garbage;
    string phredtemp;

    _seqs.empty();
    _phred.empty();
    _names.empty();
    _filelength = 0;

            //read only a part of the file i.e the first 4mio lines
    while (std::getline(in, name) && _filelength<_blocksize) {
        std::getline(in, seqtemp);
        std::getline(in, garbage);
        std::getline(in, phredtemp);

        if (seqtemp.size() != _SRlength) {
            if (seqtemp.size() != 0)
                printf("Error on read in fastq: size is invalid\n");
        } else {
            _names.push_back(name);

            for (int k = 0; k < _SRlength; k++) {

                //handle special letters
                                    if(seqtemp[k]== 'A') ...
                                    else{
                _seqs.push_back(5);
                                    }

            }
            _filelength++;
        }
    }

EDIT:

The source-file is downloadable under https://docs.google.com/open?id=0B5bvyb427McSMjM2YWQwM2YtZGU2Mi00OGVmLThkODAtYzJhODIzYjNhYTY2

I changed the function readfastq to read the file, because of some pointer problems. So if you call readfastq the blocksize (in lines) must be bigger than the number of lines to read.

SOLUTION:

I found a solution, which get the time for read in the file from 60sec to 16sec. I removed the inner-loop which handeles the special characters and do this in GPU. This decreases the read-in time and only minimal increases the GPU running time.

Thanks for your suggestions.

void readfastq(char *filename, int SRlength) {
    _filelength = 0;
    _SRlength = SRlength;

    size_t bytes_read, bytes_expected;

    FILE *fp;
    fp = fopen(filename, "r");

    fseek(fp, 0L, SEEK_END); //go to the end of file
    bytes_expected = ftell(fp); //get filesize
    fseek(fp, 0L, SEEK_SET); //go to the begining of the file

    fclose(fp);

    if ((_seqarray = (char *) malloc(bytes_expected/2)) == NULL) //allocate space for file
        err(EX_OSERR, "data malloc");


    string name;
    string seqtemp;
    string garbage;
    string phredtemp;

    boost::iostreams::stream<boost::iostreams::file_source>file(filename);


    while (std::getline(file, name)) {
        std::getline(file, seqtemp);
        std::getline(file, garbage);
        std::getline(file, phredtemp);

        if (seqtemp.size() != SRlength) {
            if (seqtemp.size() != 0)
                printf("Error on read in fastq: size is invalid\n");
        } else {
            _names.push_back(name);

            strncpy( &(_seqarray[SRlength*_filelength]), seqtemp.c_str(), seqtemp.length()); //do not handle special letters here, do on GPU

            _filelength++;
        }
    }
}

解决方案

First instead of reading the file into memory you may work with file mappings. You just have to build your program as 64-bit to fit 3GB of virtual address space (for 32-bit application only 2GB is accessible in the user mode). Or alternatively you may map & process your file by parts.

Next, it sounds to me that your bottleneck is "copying a line to a vector". Dealing with vectors involves dynamic memory allocation (heap operations), which in a critical loop hits the performance very seriously). If this is the case - either avoid using vectors, or make sure they're declared outside the loop. The latter helps because when you reallocate/clear vectors they do not free memory.

Post your code (or a part of it) for more suggestions.

EDIT:

It seems that all your bottlenecks are related to string management.

  • std::getline(in, seqtemp); reading into an std::string deals with the dynamic memory allocation.
  • _names.push_back(name); This is even worse. First the std::string is placed into the vector by value. Means - the string is copied, hence another dynamic allocation/freeing happens. Moreover, when eventually the vector is internally reallocated - all the contained strings are copied again, with all the consequences.

I recommend using neither standard formatted file I/O functions (Stdio/STL) nor std::string. To achieve better performance you should work with pointers to strings (rather than copied strings), which is possible if you map the entire file. Plus you'll have to implement the file parsing (division into lines).

Like in this code:

class MemoryMappedFileParser
{
    const char* m_sz;
    size_t m_Len;

public:

    struct String {
        const char* m_sz;
        size_t m_Len;
    };

    bool getline(String& out)
    {
        out.m_sz = m_sz;

        const char* sz = (char*) memchr(m_sz, '\n', m_Len);
        if (sz)
        {
            size_t len = sz - m_sz;

            m_sz = sz + 1;
            m_Len -= (len + 1);

            out.m_Len = len;

            // for Windows-format text files remove the '\r' as well
            if (len && '\r' == out.m_sz[len-1])
                out.m_Len--;
        } else
        {
            out.m_Len = m_Len;

            if (!m_Len)
                return false;

            m_Len = 0;
        }

        return true;
    }

};

这篇关于非常快的文本文件处理(C ++)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆