快速字符串分割与多个分隔符 [英] Fast string splitting with multiple delimiters
问题描述
我这里考察在计算器上一段时间才能找到好的算法来与多个分隔符字符串分割成矢量<字符串>
。我也发现了一些方法:
I investigated some time here on StackOverflow to find good algorithms to split strings with multiple delimiters into a vector< string >
. I also found some methods:
升压方式:
boost::split(vector, string, boost::is_any_of(" \t"));
在函数getline
方法:
std::stringstream ss(string);
std::string item;
while(std::getline(ss, item, ' ')) {
vector.push_back(item);
}
升压的记号化方式:
the tokenize way of Boost:
char_separator<char> sep(" \t");
tokenizer<char_separator<char>> tokens(string, sep);
BOOST_FOREACH(string t, tokens)
{
vector.push_back(t);
}
和凉爽的STL方式:
istringstream iss(string);
copy(istream_iterator<string>(iss),
istream_iterator<string>(),
back_inserter<vector<string> >(vector));
和Shadow2531的方法(见链接的主题)。
and the method of Shadow2531 (see the linked topic).
他们大多来自<a href=\"http://stackoverflow.com/questions/236129/how-to-split-a-string\">这个话题。但遗憾的是没有解决我的问题:
Most of them came from this topic. But they unfortunately don't solve my problem:
-
Boost的分裂是易于使用,但与大数据(在最好的情况下约1.5 * 10 ^ 6单个元素)和大约10分隔符,我使用它是可怕的慢。
Boost's split is easy to use but with the big data (about 1.5*10^6 single elements in best cases) and about 10 delimiters I am using it's horrific slow.
的函数getline
,STL和Shadow2531的方法有,我只能用一个单个字符作为分隔符的问题。我需要多几个。
The getline
, STL and Shadow2531's method have the problem that I can only use one single char as delimiter. I need a few more.
Boost的记号化,即使是在速度方面更可怕。它用了11秒10的分隔符来分割字符串转换成1.5 * 10 ^ 6个元素。
Boost's tokenize is even more horrific in the aspect of speed. It took 11 seconds with 10 delimiters to split a string into 1.5*10^6 elements.
所以我不知道该怎么做:我想有多个分隔符真快字符串分割算法。
So I don't know what to do: I want to have a really fast string splitting algorithm with multiple delimiters.
时Boost的分裂的最大还是有办法做到这一点的更快
Is Boost's split the maximum or is there a way to do it faster?
推荐答案
两件事情浮现在脑海中:
Two things come to mind:
- 使用字符串的意见,而不是串
作为分割结果,节省了大量的
分配。 - 如果你知道你是唯一
将与字符合作(以
在[0,255]范围内),请尝试使用
比特集测试成员,而不是找到
ING进入分隔符
字符。
- Use string views instead of strings as the split result, saves a lot of allocations.
- If you know you're only
going to be working with chars (in
the [0,255] range), try using a
bitset to test membership instead of
find
ing into the delimiter characters.
下面是一个快速的尝试在应用这些想法:
Here's a quick attempt at applying these ideas:
#include <vector>
#include <bitset>
#include <iostream>
#include <boost/algorithm/string/split.hpp>
#include <boost/algorithm/string/classification.hpp>
#include <boost/timer.hpp>
using namespace std;
size_t const N = 10000000;
template<typename C>
void test_custom(string const& s, char const* d, C& ret)
{
C output;
bitset<255> delims;
while( *d )
{
unsigned char code = *d++;
delims[code] = true;
}
typedef string::const_iterator iter;
iter beg;
bool in_token = false;
for( string::const_iterator it = s.begin(), end = s.end();
it != end; ++it )
{
if( delims[*it] )
{
if( in_token )
{
output.push_back(typename C::value_type(beg, it));
in_token = false;
}
}
else if( !in_token )
{
beg = it;
in_token = true;
}
}
if( in_token )
output.push_back(typename C::value_type(beg, s.end()));
output.swap(ret);
}
template<typename C>
void test_strpbrk(string const& s, char const* delims, C& ret)
{
C output;
char const* p = s.c_str();
char const* q = strpbrk(p+1, delims);
for( ; q != NULL; q = strpbrk(p, delims) )
{
output.push_back(typename C::value_type(p, q));
p = q + 1;
}
output.swap(ret);
}
template<typename C>
void test_boost(string const& s, char const* delims)
{
C output;
boost::split(output, s, boost::is_any_of(delims));
}
int main()
{
// Generate random text
string text(N, ' ');
for( size_t i = 0; i != N; ++i )
text[i] = (i % 2 == 0)?('a'+(i/2)%26):((i/2)%2?' ':'\t');
char const* delims = " \t[],-'/\\!\"§$%&=()<>?";
// Output strings
boost::timer timer;
test_boost<vector<string> >(text, delims);
cout << "Time: " << timer.elapsed() << endl;
// Output string views
typedef string::const_iterator iter;
typedef boost::iterator_range<iter> string_view;
timer.restart();
test_boost<vector<string_view> >(text, delims);
cout << "Time: " << timer.elapsed() << endl;
// Custom split
timer.restart();
vector<string> vs;
test_custom(text, delims, vs);
cout << "Time: " << timer.elapsed() << endl;
// Custom split
timer.restart();
vector<string_view> vsv;
test_custom(text, delims, vsv);
cout << "Time: " << timer.elapsed() << endl;
// Custom split
timer.restart();
vector<string> vsp;
test_strpbrk(text, delims, vsp);
cout << "Time: " << timer.elapsed() << endl;
// Custom split
timer.restart();
vector<string_view> vsvp;
test_strpbrk(text, delims, vsvp);
cout << "Time: " << timer.elapsed() << endl;
return 0;
}
这编译使用GCC 4.5.1与 -O4
标志升压1.46.1使我得到:
Compiling this with Boost 1.46.1 using GCC 4.5.1 with the -O4
flag enabled I get:
- 时间:5.951(Boost.Split +矢量)
- 时间:3.728(Boost.Split +矢量
- 时间:1.662(自定义拆分+矢量)
- 时间:0.144(自定义拆分+矢量)
- 时间:2.13(Strpbrk +矢量)
- 时间:0.527(Strpbrk +矢量)
请注意:有输出稍有差别为空的令牌是由我的自定义功能下降。但是你可以,如果你决定使用这个code适应您的需求。
NOTE: There's a slight difference in the output as empty tokens are dropped by my custom function. But you can adapt this code to your needs if you decide to use it.
这篇关于快速字符串分割与多个分隔符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!