提高::标记生成器VS的boost ::分裂 [英] boost::tokenizer vs boost::split

查看：83 发布时间：2016/8/12 17:23:21 c++ boost

本文介绍了提高::标记生成器VS的boost ::分裂的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图解析每一个'^'字符一个C ++字符串转换成矢量标记。我一直用了boost ::拆分方法，但我现在写性能的关键code，想知道哪一个提供更好的性能。

I am trying to parse a c++ string on every '^' character into a vector tokens. I have always used the boost::split method, but I am now writing performance critical code and would like to know which one gives better performance.

例如：

string message = "A^B^C^D";
vector<string> tokens;
boost::split(tokens, message, boost::is_any_of("^"));

boost::char_separator<char> sep("^");
boost::tokenizer<boost::char_separator<char> > tokens(text, sep);

哪一个会提供更好的性能，为什么？

Which one would give better performance and why?

感谢。

推荐答案

最佳选择取决于几个因素。如果你只需要一次扫描令牌，然后升压::标记生成器在运行时的性能和空间性能不错的选择（令牌会占用很大的空间，这取决于输入数据的那些向量。）

The best choice depends on a few factors. If you're only needing to scan the tokens once, then the boost::tokenizer is a good choice in both runtime and space performance (those vectors of tokens can take up a lot of space, depending on input data.)

如果你将要经常扫描记号，或者需要高效的随机访问的载体，那么升压::分成载体可能是更好的选择。

If you're going to be scanning the tokens often, or need a vector with efficient random access, then the boost::split into a vector may be the better option.

例如，在A ^ B ^ C ^ ... ^ Z输入字符串，其中令牌是长度为1个字节，在的boost ::分流/矢量＆lt;串＆GT; 方法将使用的至少的2 * N-1个字节。同程字符串存储在大多数STL实现你自己看着办吧服用超过8倍的更个性化。在载体中存储这些串中的存储器和时间方面是昂贵的。

For example, in your "A^B^C^...^Z" input string where the tokens are 1-byte in length, the boost::split/vector<string> method will consume at least 2*N-1 bytes. With the way strings are stored in most STL implementations you can figure it taking more than 8x that count. Storing these strings in a vector is costly in terms of memory and time.

我跑了快速测试我的机器，并拥有1000万令牌类似的模式上是这样的：

I ran a quick test on my machine and a similar pattern with 10 million tokens looked like this:

的boost ::分裂= 2.5秒和〜620MB

的boost ::标记生成器= 0.9S 和 0MB

boost::split = 2.5s and ~620MB
boost::tokenizer = 0.9s and 0MB

如果你只是做了记号的一次性扫描，那么显然标记生成器更好。
但是，如果你切碎成你希望你的应用程序的生命周期内重复使用的结构，然后有记号的载体可为preferred。

If you're just doing a one-time scan of the tokens, then clearly the tokenizer is better. But, if you're shredding into a structure that you want to reuse during the lifetime of your application, then having a vector of tokens may be preferred.

如果你想要去的矢量路径，那么我建议不使用矢量＆lt;串＆GT; ，但字符串::迭代器，而不是一个载体。仅仅分解到一对迭代器，并保持在你的令牌大串，以供参考。例如：

If you want to go the vector route, then I'd recommend not using a vector<string>, but a vector of string::iterators instead. Just shred into a pair of iterators and keep around your big string of tokens for reference. For example:

using namespace std;
vector<pair<string::const_iterator,string::const_iterator> > tokens;
boost::split(tokens, s, boost::is_any_of("^"));
for(auto beg=tokens.begin(); beg!=tokens.end();++beg){
   cout << string(beg->first,beg->second) << endl;
}

这个改进版采用的 1.6秒和 390MB 在同一台服务器和测试上。而且，最好这种载体的所有内存开销是令牌数线性 - 不依赖于令牌的长度的任何方式，而的std ::矢量＆lt;串＆GT; 存储每个令牌。

This improved version takes 1.6s and 390MB on the same server and test. And, best of all the memory overhead of this vector is linear with the number of tokens -- not dependent in any way on the length of tokens, whereas a std::vector<string> stores each token.

这篇关于提高::标记生成器VS的boost ::分裂的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

提高::标记生成器VS的boost ::分裂 [英] boost::tokenizer vs boost::split

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

提高::标记生成器VS的boost ::分裂 [英] boost::tokenizer vs boost::split

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭