解析字符流 [英] Parse stream of characters
问题描述
说我有一个类似这样的文件:
Say I have an file something like this:
*SP "<something>"
*VER "<something>"
*NAME_MAP
*1 abc
*2 def
...
...
*D_NET *1 <some_value>
*CONN
<whatever>
<whatever>
*CAP
*1:1 *2:2 <whatever_value>
*1:3 *2:4 <whatever_value>
*RES
<whatever>
<whatever>
让我在开始描述问题之前先描述一下文件.文件以一些标题注释开头. NAME_MAP部分具有有关为其提供的名称和ID的映射信息.当我想指定对应的名称时,该id将在以后的所有地方使用.
Let me describe the file once before I start describing my problem. File starts with some header notes. NAME_MAP section has the mapping information about the name and id given to that. That id would be used everywhere later when I want to specify corresponding name.
D_NET部分包含3个子部分,即CONN,CAP,RES.
D_NET section has 3 sub sections, CONN, CAP, RES.
我需要从该文件收集一些数据.我需要的数据与D_NET有关. 我需要
My need is to gather some data from this file. Data I need is related to D_NET. I need
*D_NET *1 <some_value>
在此行中* 1的映射,在这种情况下为abc.
mapping of *1 from this line, which in this case would be abc.
第二件事是来自D_NET部分的CAP部分. CAP部分中的任何内容,我都需要.
Second thing I need is from the CAP section of the D_NET section. Whatever is there in CAP section, I would need it.
最后,我的数据看起来像是一个哈希:
Finally, my data would look like an hash as :
* 1-> * 1,* 2(在这种情况下,只是为了使您理解) abc-> abc,def(这就是我想要的)
*1 -> *1, *2 (In this case, just to make you understand) abc -> abc, def (This is what I want)
希望到目前为止我还很清楚.
Hope I am clear till now.
由于文件大小很大,因此在多个Gb中,我已经知道,读取文件的最佳方法是将其映射到内存中.使用mmap做到了.就是这样:
Since the file size is huge, in multiple Gb's, I have figured out that best way to read an file is by mapping that into memory. Did that using mmap. Just like this:
char* data = (char*)mmap(0, file.st_size, PROT_READ, MAP_PRIVATE, fileno(file), 0);
因此,mmap指向的数据只是字符流.现在,我需要从中获取上述数据.
So, data pointed by mmap is just an stream of characters. Now, I would need to get the above mentioned data from it.
要解决此问题,我想我可以在这里先使用一些tokenizer(boost/tokenizer?)来分割新的行字符,然后解析这些行以获得所需的数据.谁会同意我的观点?如果不同意,您还会建议我什么?请提出建议.
To solve this problem, I think I could use some tokenizer(boost/tokenizer?) here first to split on new line character and then parse those lines to get the desired data. Who all will agree me on that? What else will you suggest me if not agree on that? Please suggest.
您会如何建议呢?我愿意接受任何快速算法.
How would you suggest to do it? I am open to any fast algorithm.
推荐答案
我对使用mmap
希望获得的性能提高感到好奇,因此我进行了两项测试,从媒体库中读取文件(将它们作为文本处理)文件).一种使用getline
方法,另一种使用mmap
.输入是:
I got curious about the performance gain you hope for by using mmap
so I put together two tests, reading files from my media library (treating them as text files). One using the getline
approach and one using mmap
. The input was:
files: 2012
lines: 135371784
bytes: 33501265769 (31 GiB)
首先在两个测试中使用一个帮助器类来读取文件列表:
First a helper class used in both tests to read the list of files:
filelist.hpp
#pragma once
#include <fstream>
#include <iterator>
#include <string>
#include <vector>
class Filelist {
std::vector<std::string> m_strings;
public:
Filelist(const std::string& file) :
m_strings()
{
std::ifstream is(file);
for(std::string line; std::getline(is, line);) {
m_strings.emplace_back(std::move(line));
}
/*
std::copy(
std::istream_iterator<std::string>(is),
std::istream_iterator<std::string>(),
std::back_inserter(m_strings)
);
*/
}
operator std::vector<std::string> () { return m_strings; }
};
getline.cpp
#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <iomanip>
#include "filelist.hpp"
int main(int argc, char* argv[]) {
std::vector<std::string> args(argv+1, argv+argc);
if(args.size()==0) {
Filelist tmp("all_files");
args = tmp;
}
unsigned long long total_lines=0;
unsigned long long total_bytes=0;
for(const auto& file : args) {
std::ifstream is(file);
if(is) {
unsigned long long lco=0;
unsigned long long bco=0;
bool is_good=false;
for(std::string line; std::getline(is, line); lco+=is_good) {
is_good = is.good();
bco += line.size() + is_good;
// parse here
}
std::cout << std::setw(15) << lco << " " << file << "\n";
total_lines += lco;
total_bytes += bco;
}
}
std::cout << "files processed: " << args.size() << "\n";
std::cout << "lines processed: " << total_lines << "\n";
std::cout << "bytes processed: " << total_bytes << "\n";
}
getline 结果:
files processed: 2012
lines processed: 135371784
bytes processed: 33501265769
real 2m6.096s
user 0m23.586s
sys 0m20.560s
mmap.cpp
#include <iostream>
#include <fstream>
#include <vector>
#include <iomanip>
#include "filelist.hpp"
#include <sys/mman.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
class File {
int m_fileno;
public:
File(const std::string& filename) :
m_fileno(open(filename.c_str(), O_RDONLY, O_CLOEXEC))
{
if(m_fileno==-1)
throw std::runtime_error("could not open file");
}
File(const File&) = delete;
File(File&& other) :
m_fileno(other.m_fileno)
{
other.m_fileno = -1;
}
File& operator=(const File&) = delete;
File& operator=(File&& other) {
if(m_fileno!=-1) close(m_fileno);
m_fileno = other.m_fileno;
other.m_fileno = -1;
return *this;
}
~File() {
if(m_fileno!=-1) close(m_fileno);
}
operator int () { return m_fileno; }
};
class Mmap {
File m_file;
struct stat m_statbuf;
char* m_data;
const char* m_end;
struct stat pstat(int fd) {
struct stat rv;
if(fstat(fd, &rv)==-1)
throw std::runtime_error("stat failed");
return rv;
}
public:
Mmap(const Mmap&) = delete;
Mmap(Mmap&& other) :
m_file(std::move(other.m_file)),
m_statbuf(std::move(other.m_statbuf)),
m_data(other.m_data),
m_end(other.m_end)
{
other.m_data = nullptr;
}
Mmap& operator=(const Mmap&) = delete;
Mmap& operator=(Mmap&& other) {
m_file = std::move(other.m_file);
m_statbuf = std::move(other.m_statbuf);
m_data = other.m_data;
m_end = other.m_end;
other.m_data = nullptr;
return *this;
}
Mmap(const std::string& filename) :
m_file(filename),
m_statbuf(pstat(m_file)),
m_data(reinterpret_cast<char*>(mmap(0, m_statbuf.st_size, PROT_READ, MAP_PRIVATE, m_file, 0))),
m_end(nullptr)
{
if(m_data==MAP_FAILED)
throw std::runtime_error("mmap failed");
m_end = m_data+size();
}
~Mmap() {
if(m_data!=nullptr)
munmap(m_data, m_statbuf.st_size);
}
inline size_t size() const { return m_statbuf.st_size; }
operator const char* () { return m_data; }
inline const char* cbegin() const { return m_data; }
inline const char* cend() const { return m_end; }
inline const char* begin() const { return cbegin(); }
inline const char* end() const { return cend(); }
};
int main(int argc, char* argv[]) {
std::vector<std::string> args(argv+1, argv+argc);
if(args.size()==0) {
Filelist tmp("all_files");
args = tmp;
}
unsigned long long total_lines=0;
unsigned long long total_bytes=0;
for(const auto& file : args) {
try {
unsigned long long lco=0;
unsigned long long bco=0;
Mmap im(file);
for(auto ch : im) {
if(ch=='\n') ++lco;
++bco;
}
std::cout << std::setw(15) << lco << " " << file << "\n";
total_lines += lco;
total_bytes += bco;
} catch(const std::exception& ex) {
std::clog << "Exception: " << file << " " << ex.what() << "\n";
}
}
std::cout << "files processed: " << args.size() << "\n";
std::cout << "lines processed: " << total_lines << "\n";
std::cout << "bytes processed: " << total_bytes << "\n";
}
mmap 结果:
files processed: 2012
lines processed: 135371784
bytes processed: 33501265769
real 2m8.289s
user 0m51.862s
sys 0m12.335s
我像这样在彼此之后进行测试:
I ran the tests right after eachother like this:
% ./mmap
% time ./getline
% time ./mmap
...,他们得到了非常相似的结果.如果我不知所措,那么我会先寻求简单的getline
解决方案,然后尝试按照您要执行的映射来确定逻辑.如果以后觉得放慢脚步,请尝试mmap
,如果您能找到使它比我更有效的方法.
... and they got very similar results. If I were in your shoes, I'd go for the simple getline
solution first and try to get the logic in place with that mapping you've got going. If that later feels to slow, go for mmap
if you can find some way to make it more effective than I did.
免责声明:我对mmap
没有太多经验,所以也许我误用了它来获得可以通过文本文件进行解析的性能.
Disclaimer: I don't have much experience with mmap
so perhaps I've used it wrong to get the performace it can deliver parsing through text files.
更新:我将所有文件串联到一个31 GiB文件中,然后再次运行测试.结果有点令人惊讶,我觉得我错过了一些东西.
Update: I concatenated all the files into one 31 GiB file and ran the tests again. The result was a bit surprising and I feel that I'm missing something.
getline 结果:
files processed: 1
lines processed: 135371784
bytes processed: 33501265769
real 2m1.104s
user 0m22.274s
sys 0m19.860s
mmap 结果:
files processed: 1
lines processed: 135371784
bytes processed: 33501265769
real 2m22.500s
user 0m50.183s
sys 0m13.124s
这篇关于解析字符流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!