解析字符流 [英] Parse stream of characters

查看:92
本文介绍了解析字符流的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我有一个类似这样的文件:

Say I have an file something like this:

*SP "<something>"
*VER "<something>"

*NAME_MAP
*1 abc
*2 def
...
...

*D_NET *1 <some_value>
*CONN
<whatever>
<whatever>
*CAP
*1:1 *2:2 <whatever_value>
*1:3 *2:4 <whatever_value>
*RES
<whatever>
<whatever>

让我在开始描述问题之前先描述一下文件.文件以一些标题注释开头. NAME_MAP部分具有有关为其提供的名称和ID的映射信息.当我想指定对应的名称时,该id将在以后的所有地方使用.

Let me describe the file once before I start describing my problem. File starts with some header notes. NAME_MAP section has the mapping information about the name and id given to that. That id would be used everywhere later when I want to specify corresponding name.

D_NET部分包含3个子部分,即CONN,CAP,RES.

D_NET section has 3 sub sections, CONN, CAP, RES.

我需要从该文件收集一些数据.我需要的数据与D_NET有关. 我需要

My need is to gather some data from this file. Data I need is related to D_NET. I need

*D_NET *1 <some_value>

在此行中* 1的映射,在这种情况下为abc.

mapping of *1 from this line, which in this case would be abc.

第二件事是来自D_NET部分的CAP部分. CAP部分中的任何内容,我都需要.

Second thing I need is from the CAP section of the D_NET section. Whatever is there in CAP section, I would need it.

最后,我的数据看起来像是一个哈希:

Finally, my data would look like an hash as :

* 1-> * 1,* 2(在这种情况下,只是为了使您理解) abc-> abc,def(这就是我想要的)

*1 -> *1, *2 (In this case, just to make you understand) abc -> abc, def (This is what I want)

希望到目前为止我还很清楚.

Hope I am clear till now.

由于文件大小很大,因此在多个Gb中,我已经知道,读取文件的最佳方法是将其映射到内存中.使用mmap做到了.就是这样:

Since the file size is huge, in multiple Gb's, I have figured out that best way to read an file is by mapping that into memory. Did that using mmap. Just like this:

char* data = (char*)mmap(0, file.st_size, PROT_READ, MAP_PRIVATE, fileno(file), 0);

因此,mmap指向的数据只是字符流.现在,我需要从中获取上述数据.

So, data pointed by mmap is just an stream of characters. Now, I would need to get the above mentioned data from it.

要解决此问题,我想我可以在这里先使用一些tokenizer(boost/tokenizer?)来分割新的行字符,然后解析这些行以获得所需的数据.谁会同意我的观点?如果不同意,您还会建议我什么?请提出建议.

To solve this problem, I think I could use some tokenizer(boost/tokenizer?) here first to split on new line character and then parse those lines to get the desired data. Who all will agree me on that? What else will you suggest me if not agree on that? Please suggest.

您会如何建议呢?我愿意接受任何快速算法.

How would you suggest to do it? I am open to any fast algorithm.

推荐答案

我对使用mmap希望获得的性能提高感到好奇,因此我进行了两项测试,从媒体库中读取文件(将它们作为文本处理)文件).一种使用getline方法,另一种使用mmap.输入是:

I got curious about the performance gain you hope for by using mmap so I put together two tests, reading files from my media library (treating them as text files). One using the getline approach and one using mmap. The input was:

files: 2012
lines: 135371784
bytes: 33501265769 (31 GiB)

首先在两个测试中使用一个帮助器类来读取文件列表:

First a helper class used in both tests to read the list of files:

filelist.hpp

#pragma once

#include <fstream>
#include <iterator>
#include <string>
#include <vector>

class Filelist {
    std::vector<std::string> m_strings;
public:
    Filelist(const std::string& file) :
        m_strings()
    {
        std::ifstream is(file);
        for(std::string line; std::getline(is, line);) {
            m_strings.emplace_back(std::move(line));
        }
        /*
        std::copy(
            std::istream_iterator<std::string>(is),
            std::istream_iterator<std::string>(),
            std::back_inserter(m_strings)
        );
        */
    }

    operator std::vector<std::string> () { return m_strings; }
};

getline.cpp

#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <iomanip>
#include "filelist.hpp"

int main(int argc, char* argv[]) {
    std::vector<std::string> args(argv+1, argv+argc);
    if(args.size()==0) {
        Filelist tmp("all_files");
        args = tmp;
    }

    unsigned long long total_lines=0;
    unsigned long long total_bytes=0;

    for(const auto& file : args) {
        std::ifstream is(file);
        if(is) {
            unsigned long long lco=0;
            unsigned long long bco=0;
            bool is_good=false;
            for(std::string line; std::getline(is, line); lco+=is_good) {
                is_good = is.good();
                bco += line.size() + is_good;
                // parse here
            }
            std::cout << std::setw(15) << lco << " " << file << "\n";
            total_lines += lco;
            total_bytes += bco;
        }
    }
    std::cout << "files processed: " << args.size() << "\n";
    std::cout << "lines processed: " << total_lines << "\n";
    std::cout << "bytes processed: " << total_bytes << "\n";
}

getline 结果:

files processed: 2012
lines processed: 135371784
bytes processed: 33501265769

real    2m6.096s
user    0m23.586s
sys     0m20.560s

mmap.cpp

#include <iostream>
#include <fstream>
#include <vector>
#include <iomanip>
#include "filelist.hpp"

#include <sys/mman.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

class File {
    int m_fileno;
public:
    File(const std::string& filename) :
        m_fileno(open(filename.c_str(), O_RDONLY, O_CLOEXEC))
    {
        if(m_fileno==-1)
            throw std::runtime_error("could not open file");
    }
    File(const File&) = delete;
    File(File&& other) :
        m_fileno(other.m_fileno)
    {
        other.m_fileno = -1;
    }
    File& operator=(const File&) = delete;
    File& operator=(File&& other) {
        if(m_fileno!=-1) close(m_fileno);
        m_fileno = other.m_fileno;
        other.m_fileno = -1;
        return *this;
    }
    ~File() {
        if(m_fileno!=-1) close(m_fileno);
    }
    operator int () { return m_fileno; }
};

class Mmap {
    File m_file;
    struct stat m_statbuf;
    char* m_data;
    const char* m_end;

    struct stat pstat(int fd) {
        struct stat rv;
        if(fstat(fd, &rv)==-1)
            throw std::runtime_error("stat failed");
        return rv;
    }
public:
    Mmap(const Mmap&) = delete;
    Mmap(Mmap&& other) :
        m_file(std::move(other.m_file)),
        m_statbuf(std::move(other.m_statbuf)),
        m_data(other.m_data),
        m_end(other.m_end)
    {
        other.m_data = nullptr;
    }
    Mmap& operator=(const Mmap&) = delete;
    Mmap& operator=(Mmap&& other) {
        m_file = std::move(other.m_file);
        m_statbuf = std::move(other.m_statbuf);
        m_data = other.m_data;
        m_end = other.m_end;
        other.m_data = nullptr;
        return *this;
    }

    Mmap(const std::string& filename) :
        m_file(filename),
        m_statbuf(pstat(m_file)),
        m_data(reinterpret_cast<char*>(mmap(0, m_statbuf.st_size, PROT_READ, MAP_PRIVATE, m_file, 0))),
        m_end(nullptr)
    {
        if(m_data==MAP_FAILED)
            throw std::runtime_error("mmap failed");
        m_end = m_data+size();
    }
    ~Mmap() {
        if(m_data!=nullptr)
            munmap(m_data, m_statbuf.st_size);
    }

    inline size_t size() const { return m_statbuf.st_size; }
    operator const char* () { return m_data; }

    inline const char* cbegin() const { return m_data; }
    inline const char* cend() const { return m_end; }
    inline const char* begin() const { return cbegin(); }
    inline const char* end() const { return cend(); }
};

int main(int argc, char* argv[]) {
    std::vector<std::string> args(argv+1, argv+argc);
    if(args.size()==0) {
        Filelist tmp("all_files");
        args = tmp;
    }

    unsigned long long total_lines=0;
    unsigned long long total_bytes=0;

    for(const auto& file : args) {
        try {
            unsigned long long lco=0;
            unsigned long long bco=0;
            Mmap im(file);
            for(auto ch : im) {
                if(ch=='\n') ++lco;
                ++bco;
            }
            std::cout << std::setw(15) << lco << " " << file << "\n";
            total_lines += lco;
            total_bytes += bco;
        } catch(const std::exception& ex) {
            std::clog << "Exception: " << file << " " << ex.what() << "\n";
        }
    }
    std::cout << "files processed: " << args.size() << "\n";
    std::cout << "lines processed: " << total_lines << "\n";
    std::cout << "bytes processed: " << total_bytes << "\n";
}

mmap 结果:

files processed: 2012
lines processed: 135371784
bytes processed: 33501265769

real    2m8.289s
user    0m51.862s
sys     0m12.335s

我像这样在彼此之后进行测试:

I ran the tests right after eachother like this:

% ./mmap
% time ./getline
% time ./mmap

...,他们得到了非常相似的结果.如果我不知所措,那么我会先寻求简单的getline解决方案,然后尝试按照您要执行的映射来确定逻辑.如果以后觉得放慢脚步,请尝试mmap,如果您能找到使它比我更有效的方法.

... and they got very similar results. If I were in your shoes, I'd go for the simple getline solution first and try to get the logic in place with that mapping you've got going. If that later feels to slow, go for mmap if you can find some way to make it more effective than I did.

免责声明:我对mmap没有太多经验,所以也许我误用了它来获得可以通过文本文件进行解析的性能.

Disclaimer: I don't have much experience with mmap so perhaps I've used it wrong to get the performace it can deliver parsing through text files.

更新:我将所有文件串联到一个31 GiB文件中,然后再次运行测试.结果有点令人惊讶,我觉得我错过了一些东西.

Update: I concatenated all the files into one 31 GiB file and ran the tests again. The result was a bit surprising and I feel that I'm missing something.

getline 结果:

files processed: 1
lines processed: 135371784
bytes processed: 33501265769

real    2m1.104s
user    0m22.274s
sys     0m19.860s

mmap 结果:

files processed: 1
lines processed: 135371784
bytes processed: 33501265769

real    2m22.500s
user    0m50.183s
sys     0m13.124s

这篇关于解析字符流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆