_efficiently_从std :: string获取第n个单词 [英] _efficiently_ obtain n-th word from a std::string

查看:120
本文介绍了_efficiently_从std :: string获取第n个单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



嗨!


我需要一个例程:


std :: string nth_word(const std :: string& s,unsigned int n){

//从字符串返回第n个字,n是从0开始的

// if'' s''包含太少的单词,返回""

//''words''是非空白字符的任何序列

//领先,尾随和多个空格字符

//应该被忽略。

//例如。 这些是四个,而不是这些。

}


我当前正在使用类似的东西:


std :: string nth_word(const std :: string& source,unsigned int n){

//添加 "允许提取最后一个

//单词,之后如果没有空格,则ss将在下面eof()

stringstream ss(source +"" );

string s;

for(unsigned int k = 0; k< = n; k ++){

ss> s;

if(!ss.good())return"" ;; // eof

}

返回s;

}


这很好,除了它表现不佳。在我被指责过早优化之前,我要告诉你

,我描述了我的代码,超过50%的时间花在了这个

例程。这并不让我感到惊讶 - 我正在以GB的顺序从文本文件中提取单词

,这需要烦人的

长...

我正在考虑find_first_not_of和

find_first_of的组合,但在我编码之前,也许有人可以对此进行评论吗?我有一种直觉,有些讨厌的

strtok hack会更快,是吗?或者是否有一些其他的,以性能为导向的方式,如使用指针遍历

s.c_str()并记忆相关部分?


TIA,

- J.

解决方案

Jacek Dziedzic写道:


>

嗨!


我需要一个例程:


std :: string nth_word(const std :: string& s,unsigned int n){

//从字符串返回第n个字,n是从0开始的

//如果''s''包含太少的单词,则返回""

//''words''是任何非空白字符序列

//前导,尾随和多个空白字符

//应该被忽略。

//例如。 这些是四个,而不是这些。

}


我当前正在使用类似的东西:


std :: string nth_word(const std :: string& source,unsigned int n){

//添加 "允许提取最后一个

//单词,之后如果没有空格,则ss将在下面eof()

stringstream ss(source +"" );

string s;

for(unsigned int k = 0; k< = n; k ++){

ss> s;

if(!ss.good())return"" ;; // eof

}

返回s;

}


这很好,除了它表现不佳。在我被指责过早优化之前,我要告诉你

,我描述了我的代码,超过50%的时间花在了这个

例程。这并不让我感到惊讶 - 我正在以GB的顺序从文本文件中提取单词

,这需要烦人的

长...

我正在考虑find_first_not_of和

find_first_of的组合,但在我编码之前,也许有人可以对此进行评论吗?我有一种直觉,有些讨厌的

strtok hack会更快,是吗?或者是否有一些其他的,以性能为导向的方式,如使用指针遍历

s.c_str()并记忆相关部分?



Nah。如果效率至关重要,我发现istream接口是一个不好的解决方案。


没有什么比状态机更好了。


//警告 - 脑转储 - 未编译 - 未经测试 - 使用你的

自己的危险


#include< cctype>


std :: string nth_word(const std :: string& source,unsigned int n)

{


enum state_type {inspace,inword};


std :: string :: const_iterator l_iter = source.begin();

const std :: string :: const_iterator l_end = source.end();


if(l_end == l_iter)

{

return"" ;;

}

std :: string :: const_iterator l_beg_word = l_iter;


state_type state =(std :: isspace(* l_iter))? inspace:inword;


int word_count = state == inspace? 0:1;


++ l_iter;


while(l_end!= l_iter)

{


开关(状态)

{

案件空间:


if(std: :isspace(* l_iter))break;

state = inword;

l_beg_word = l_iter;

++ word_count;

休息;


案例内容:


if(!std :: isspace(* l_iter))break;

state = inspace;


if(n == word_count)

{

return std :: string( l_beg_word,l_iter);

}

休息;

}


++ l_iter; < br $>
}


开关(州)

{

案件空间:

返回" ;;

案例inword:

if(n == word_count)

{

返回std :: string(l_beg_word,l_iter);

}

返回"";

}


返回" ;;

}

Gianni Mariani写道:


Jacek Dziedzic写道:



....


>这很好,除了它表现不佳。在我发火之前



....


哦 - 我忘了提及,如果你稍微扩展状态机,你可以将状态机应用到整个文件。然后你可以将整个文件映射到

,并且永远不要将文件复制到缓冲区中。





Jacek Dziedzic写道:


我需要一个例程:


std :: string nth_word(const std :: string& s,unsigned int n){

//从字符串返回第n个字,n是从0开始的

//如果是''包含太少的单词,返回'"

//''words''是非空白字符的任何序列

//领先,尾随和多个空格字符

//应该被忽略。

//例如。 这些是四个,而不是这些。

}


我当前正在使用类似的东西:


std :: string nth_word(const std :: string& source,unsigned int n){

//添加 "允许提取最后一个

//单词,之后如果没有空格,则ss将在下面eof()

stringstream ss(source +"" );

string s;

for(unsigned int k = 0; k< = n; k ++){

ss> s;

if(!ss.good())return"" ;; // eof

}

返回s;

}


这很好,除了它表现不佳。在我被指责过早优化之前,我要告诉你

,我描述了我的代码,超过50%的时间花在了这个

例程。这并不让我感到惊讶 - 我正在以GB的顺序从文本文件中提取单词

,这需要烦人的

long ...


我担心你的努力不会有多大帮助。

如果你的第n个单词从你的文本文件中的第i个位置开始,你将会有/>
来检查所有前面的位置以便知道(好吧,不完全是

是真的,在某些情况下你可以跳过一个角色,但那不是
真的很重要)。如果你的文字包含数十亿个字符,那可能需要花费很长时间。


可以避免的单个开销是复制你的字符串

根本不需要。它当然足以迭代

字符串并计算字数,直到找到第n个字。

如果你依赖于体系结构,mmx或者其他流媒体指令

也许可以帮助你(我真的不知道...... mmx有8个打包字节的比较

指令,但我想知道是否这将提供任何

加速,因为你必须检查每个字节的结果)


我在想一个find_first_not_of和

find_first_of的组合,但在我编码之前,也许有人可以对此进行评论吗?我有一种直觉,有些讨厌的

strtok hack会更快,是吗?或者是否有一些其他的,以性能为导向的方式,如使用指针遍历

s.c_str()并记忆相关部分?



我不推荐任何这些...


使用string :: iterator。检查个别角色。有一个

当前状态(单词,非单词),根据你读取的
更新状态。计数切换。从第n个单词构造一个字符串。


Markus



Hi!

I need a routine like:

std::string nth_word(const std::string &s, unsigned int n) {
// return n-th word from the string, n is 0-based
// if ''s'' contains too few words, return ""
// ''words'' are any sequences of non-whitespace characters
// leading, trailing and multiple whitespace characters
// should be ignored.
// eg. "These are four\t\twords\t\t".
}

I am currenlty using something like:

std::string nth_word(const std::string& source, unsigned int n) {
// the addition of " " allows for the extraction of the last
// word, after which ss would go eof() below if not for the space
stringstream ss(source+" ");
string s;
for(unsigned int k=0;k<=n;k++) {
ss >s;
if(!ss.good()) return ""; // eof
}
return s;
}

which is fine, except it performs poorly. Before I''m flamed
with accusations of premature optimization, let me tell you
that I profiled my code and over 50% of time is spent in this
routine. This does not surprise me -- I am extracting words
from text files in the order of GB and it takes annoyingly
long...

I''m thinking of a combination of find_first_not_of and
find_first_of, but before I code it, perhaps somebody can
comment on this? I have a gut feeling that some nasty
strtok hack would be even faster, would it? Or is there
perhaps some other, performance-oriented way like traversing
s.c_str() with a pointer and memcpying out the relevant part?

TIA,
- J.

解决方案

Jacek Dziedzic wrote:

>
Hi!

I need a routine like:

std::string nth_word(const std::string &s, unsigned int n) {
// return n-th word from the string, n is 0-based
// if ''s'' contains too few words, return ""
// ''words'' are any sequences of non-whitespace characters
// leading, trailing and multiple whitespace characters
// should be ignored.
// eg. "These are four\t\twords\t\t".
}

I am currenlty using something like:

std::string nth_word(const std::string& source, unsigned int n) {
// the addition of " " allows for the extraction of the last
// word, after which ss would go eof() below if not for the space
stringstream ss(source+" ");
string s;
for(unsigned int k=0;k<=n;k++) {
ss >s;
if(!ss.good()) return ""; // eof
}
return s;
}

which is fine, except it performs poorly. Before I''m flamed
with accusations of premature optimization, let me tell you
that I profiled my code and over 50% of time is spent in this
routine. This does not surprise me -- I am extracting words
from text files in the order of GB and it takes annoyingly
long...

I''m thinking of a combination of find_first_not_of and
find_first_of, but before I code it, perhaps somebody can
comment on this? I have a gut feeling that some nasty
strtok hack would be even faster, would it? Or is there
perhaps some other, performance-oriented way like traversing
s.c_str() with a pointer and memcpying out the relevant part?

Nah. If efficiency is critical, I have found the istream interface a
bad solution.

Nothing like a state machine.

// warning - brain dump - not compiled ever - not tested - use at your
own peril

#include <cctype>

std::string nth_word(const std::string& source, unsigned int n)
{

enum state_type { inspace, inword };

std::string::const_iterator l_iter = source.begin();
const std::string::const_iterator l_end = source.end();

if ( l_end == l_iter )
{
return "";
}
std::string::const_iterator l_beg_word = l_iter;

state_type state = ( std::isspace( * l_iter ) ) ? inspace : inword;

int word_count = state == inspace ? 0 : 1;

++ l_iter;

while ( l_end != l_iter )
{

switch ( state )
{
case inspace :

if ( std::isspace( * l_iter ) ) break;
state = inword;
l_beg_word = l_iter;
++ word_count;
break;

case inword :

if ( ! std::isspace( * l_iter ) ) break;
state = inspace;

if ( n == word_count )
{
return std::string( l_beg_word, l_iter );
}
break;
}

++ l_iter;
}

switch ( state )
{
case inspace :
return "";
case inword :
if ( n == word_count )
{
return std::string( l_beg_word, l_iter );
}
return "";
}

return "";
}


Gianni Mariani wrote:

Jacek Dziedzic wrote:

....

> which is fine, except it performs poorly. Before I''m flamed

....

oh - I forgot to mention, if you extend the state machine a little, you
can apply the state machine to an entire file. Then you can mmap the
entire file and never copy the file into and out of buffers.


Hi

Jacek Dziedzic wrote:

I need a routine like:

std::string nth_word(const std::string &s, unsigned int n) {
// return n-th word from the string, n is 0-based
// if ''s'' contains too few words, return ""
// ''words'' are any sequences of non-whitespace characters
// leading, trailing and multiple whitespace characters
// should be ignored.
// eg. "These are four\t\twords\t\t".
}

I am currenlty using something like:

std::string nth_word(const std::string& source, unsigned int n) {
// the addition of " " allows for the extraction of the last
// word, after which ss would go eof() below if not for the space
stringstream ss(source+" ");
string s;
for(unsigned int k=0;k<=n;k++) {
ss >s;
if(!ss.good()) return ""; // eof
}
return s;
}

which is fine, except it performs poorly. Before I''m flamed
with accusations of premature optimization, let me tell you
that I profiled my code and over 50% of time is spent in this
routine. This does not surprise me -- I am extracting words
from text files in the order of GB and it takes annoyingly
long...

I''m afraid that your efforts will not help much.
If your n-th word starts at position i in your text file, you will have
to inspect all preceding positions to know that (well, not entirely
true, in some circumstances you can skip a character, but that doesn''t
really matter). If your text contains billions of characters, that might
take very long.

The single overhead that could be avoided is copying strings that you
don''t need at all. It would of course be enough to iterate over the
string and count the number of words until you have found the n-th word.
If you go architecture dependent, mmx or other streaming instructions
might be able to help you (I really don''t know... mmx has comparison
instructions for 8 packed bytes, but I wonder if this will provide any
speed-up, as you would have to check results for each byte)

I''m thinking of a combination of find_first_not_of and
find_first_of, but before I code it, perhaps somebody can
comment on this? I have a gut feeling that some nasty
strtok hack would be even faster, would it? Or is there
perhaps some other, performance-oriented way like traversing
s.c_str() with a pointer and memcpying out the relevant part?

I would not recommend any of those...

Use a string::iterator. Inspect the individual characters. Have a
current state (word, non-word), update the state according to what you
read. Count toggles. Construct a string from the n-th word.

Markus


这篇关于_efficiently_从std :: string获取第n个单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆