字符串标记化 [英] string tokenizing

查看:50
本文介绍了字符串标记化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在google上找了一个答案,但是我没有找到任何东西使用boost来充分回答我的问题



做字符串标记化(注意:我不能使用boost)。例如,我

试过这个:


#include< algorithm>

#include< cctype>

#include< climits>

#include< deque>

#include< iostream>

#include < iterator>

#include< string>


使用命名空间std;


int

main()

{

string delim;

int c;


/ * fill delim * /

for(c = 0; c< CHAR_MAX; c ++){//我试过#include< limits>,但是

失败了...

if((isspace(c)|| ispunct(c))

&&!(c ==''_''|| c ==''#'')

delim + = c;

}


string buf;

string :: size_type op,np;

deque< string> tok;


while(std :: getline(cin,buf)& &!cin.fail()){

op = 0;

while((np = buf.find_first_of(delim,op)) != buf.npos){

tok.push_back(string(& buf [op],np-op));

if((op = buf.find_first_not_of (delim,np))== buf.npos)

休息;

}

tok.push_back(字符串(& buf [op ]));


cout<< buf<< endl;

copy(tok.begin(),tok.end(),ostream_iterator< string>(cout,

" \ n"));

cout<< endl;

tok.clear();

}

返回0;

}


内部循环基本上找到由

delim中的任何字符分隔的标记,其中多个分隔符可能出现在标记之间(算法

遵循clc ++上的一些建议) )。但是,这个方法似乎有点笨拙,特别是对于临时对象。 (另外,它没有
似乎正常工作。例如,最后一个令牌在第二个外循环迭代中被损坏了。)

此外,拥有像


int tokenize(const string& s,container< string>& c)这样的函数会非常好;


返回插入容器的标记数。

但是,你怎么写这个c所以c是任何容器模型?我不确定

你可以,因为他们不共享一个基类。有没有更好的方法?


当然,使用C和C ++混合很容易:


for(char * t = strtok(buf,delim); t!= 0; t = strtok(0,delim))

tok.push_back(t);


其中buf和delim本质上是char *'s。但是,这似乎也不能令人满意。


/ david


-

安德烈,一个简单的农民,只有一件事在他的脑海中沿着东墙悄悄地走了过来:''安德烈,蠕动......安德烈,蠕动......安德烈,蠕动。 '

- 未知

解决方案

" David Rubin" <博*********** @ nomail.com>在留言中写道

新闻:3F *************** @ nomail.com ...

我在google上查了一下回答,但是我没有找到任何使用boost的东西,这足以回答我的问题:做字符串标记化的好方法(注意:我不能使用boost)。例如,我试过这个:


下面的备注。

#include< algorithm>
#include< cctype> ;
#include< climits>
#include< deque>
#include< iostream>
#include< iterator>
#include< string> ;

使用命名空间std;

int
main()
{
string delim;
int c;

/ *填充delim * /
for(c = 0; c< CHAR_MAX; c ++)



是否有特殊原因你从循环中排除了

值''CHAR_MAX''? (使用<而不是< =)

{//我尝试了#include< limits>,但
失败了......


发生什么事了?


#include< limits>


std :: numeric_limits< char> :: max();


应该有效。


更多以下。

if((isspace(c)|| ispunct(c))
&&!(c ==''_''|| c ==''#'')
delim + = c;
}

string buf;
string :: size_type op,np;
deque< string> tok;

while(std :: getline(cin,buf)&&!cin .fail()){
op = 0;
while((np = buf.find_first_of(delim,op))!= buf.npos){
tok.push_back(string(&) ; buf [op],np-op));
if((op = buf.find_first_not_of(delim,np))== buf.npos)
break;
}
tok.push_back(string(& buf [op]));

cout<&l t; buf<< endl;
copy(tok.begin(),tok.end(),ostream_iterator< string>(cout,
" \ n"));
cout<< endl;
tok.clear();
}
返回0;
}
内循环基本上找到由 delim,在令牌之间可能出现多个分隔符(算法
遵循clc ++上的一些建议)。但是,这种方法看起来有点笨拙,特别是对于临时物体。 (而且,它似乎没有正常工作。例如,最后一个令牌在第二个外循环迭代中被破坏。)

此外,它会非常好具有类似

int tokenize(const string& s,container< string>& c)的函数;

返回插入容器中的标记数。
但是,你怎么写这个c所以c是任何容器模型?我不确定你能不能分享一个基地。有没有更好的方法?


我发现你的代码很有趣,所以我可能会用它玩一下

,并告诉你我是否有任何想法。


但是这里有一些值得思考的东西:概括的一种方式

容器访问是使用迭代器,

< algorithm> ;.


模板< typename T>

T :: size_type tokenize(const std :: string& s,

T :: iterator求助,

T :: iterator结束)

{

}


要插入新元素,可以使用迭代器

适配器,例如的std :: insert_iterator。您甚至可以使用

ostream_iterator将输出流用作容器。

当然,这很容易做到C和C ++的混合:

for(char * t = strtok(buf,delim); t!= 0; t = strtok(0,delim))
tok.push_back(t );


这与你对字符串的const引用的参数类型相矛盾,

因为''strtok()''修改了它的参数。

其中buf和delim基本上是char *'。然而,这似乎也不令人满意。




是的,''strtok()''可能有问题,只是因为

它修改了它的参数,需要创建一个副本,如果

你想保持参数const。


HTH,

-Mike


" Mike Wahler" < MK ****** @ mkwahler.net>在消息中写道

新闻:Fk ***************** @ newsread4.news.pas.earthl ink.net ...
< blockquote class =post_quotes> template< typename T>
T :: size_type tokenize(const std :: string& s,
T :: iterator beg,
T :: iterator end )
{
}




哎呀,我的意思是让那些迭代器参数const refs

以及


const T :: iterator&求,const T :: iterator&结束


-Mike


Mike Wahler写道:


[snip] < blockquote class =post_quotes>

{//我尝试了#include< limits>,但
失败了......
发生了什么?



foo.cc:9:限制:没有这样的文件或目录


由于某种原因,我的编译器找不到该文件。否则,我同意

与你...

#include< limits>

std :: numeric_limits< char> :: max( );

应该工作。


[snip]但是这里有一些值得思考的东西:一种概括的方法。
容器访问是与迭代器一样的,而< algorithm>。
模板< typename T>
T :: size_type tokenize(const std :: string& s,
T :: iterator beg,
T :: iterator end)
{
}




这个好主意!我知道你应该能做到这一点,但是我不知道怎么回事。这是重构的代码:


模板< typename InsertIter>

void

tokenize(const string& buf,const string& ; delim,InsertIter ii)

{

字符串字;

string :: size_type sp,ep; //开始/结束位置


sp = 0;

do {

sp = buf.find_first_not_of(delim,sp) ;

ep = buf.find_first_of(delim,sp);

if(sp!= ep){

if(ep == buf .npos)

ep = buf.length();

word = buf.substr(sp,ep-sp);

* ii ++ = lc(word);

sp = buf.find_first_not_of(delim,ep + 1);

}

} while(sp!= buf.npos);


if(sp!= buf.npos){

word = buf.substr(sp,buf.length() - sp );

* ii ++ = lc(字);

}

}


被称为


tokenize(buf,delim,insert_iter< deque< string>>(令牌,

tokens.begin()));


orignal规范返回了解析的令牌数。现在我必须

结算检查


if(tokens.size()> 0){...}

/ david


-

安德烈,一个简单的农民,只有一件事在他脑海中浮现

沿着东墙:''安德烈,蠕动......安德烈,蠕动......安德烈,蠕动。''

- 未知

I looked on google for an answer, but I didn''t find anything short of
using boost which sufficiently answers my question: what is a good way
of doing string tokenization (note: I cannot use boost). For example, I
have tried this:

#include <algorithm>
#include <cctype>
#include <climits>
#include <deque>
#include <iostream>
#include <iterator>
#include <string>

using namespace std;

int
main()
{
string delim;
int c;

/* fill delim */
for(c=0; c < CHAR_MAX; c++){ // I tried #include <limits>, but
failed...
if((isspace(c) || ispunct(c))
&& !(c == ''_'' || c == ''#'')
delim += c;
}

string buf;
string::size_type op, np;
deque<string> tok;

while(std::getline(cin, buf) && !cin.fail()){
op = 0;
while((np=buf.find_first_of(delim, op)) != buf.npos){
tok.push_back(string(&buf[op], np-op));
if((op=buf.find_first_not_of(delim, np)) == buf.npos)
break;
}
tok.push_back(string(&buf[op]));

cout << buf << endl;
copy(tok.begin(), tok.end(), ostream_iterator<string>(cout,
"\n"));
cout << endl;
tok.clear();
}
return 0;
}

The inner loop basically finds tokens delimited by any character in
delim where multiple delimiters may appear between tokens (algorithm
follows some advice found on clc++). However, the method seems a little
clumsy, especially with respect to temporary objects. (Also, it does not
seem to work correctly. For example, the last token gets corrupted in
the second outer loop iteration.)

Also, it would be very nice to have a function like

int tokenize(const string& s, container<string>& c);

which returns the number of tokens, inserted into the container.
However, how do you write this so c is any container model? I''m not sure
you can since they don''t share a base class. Is there any better way?

Certainly, this is easy to do with a mix of C and C++:

for(char *t=strtok(buf, delim); t != 0; t=strtok(0, delim))
tok.push_back(t);

where buf and delim are essentially char*''s. However, this seems
unsatisfactory as well.

/david

--
Andre, a simple peasant, had only one thing on his mind as he crept
along the East wall: ''Andre, creep... Andre, creep... Andre, creep.''
-- unknown

解决方案

"David Rubin" <bo***********@nomail.com> wrote in message
news:3F***************@nomail.com...

I looked on google for an answer, but I didn''t find anything short of
using boost which sufficiently answers my question: what is a good way
of doing string tokenization (note: I cannot use boost). For example, I
have tried this:
Remarks below.

#include <algorithm>
#include <cctype>
#include <climits>
#include <deque>
#include <iostream>
#include <iterator>
#include <string>

using namespace std;

int
main()
{
string delim;
int c;

/* fill delim */
for(c=0; c < CHAR_MAX; c++)

Is there a particular reason you''re excluding the
value ''CHAR_MAX'' from the loop? (using < instead of <=)
{ // I tried #include <limits>, but
failed...
What happened?

#include <limits>

std::numeric_limits<char>::max();

should work.

More below.
if((isspace(c) || ispunct(c))
&& !(c == ''_'' || c == ''#'')
delim += c;
}

string buf;
string::size_type op, np;
deque<string> tok;

while(std::getline(cin, buf) && !cin.fail()){
op = 0;
while((np=buf.find_first_of(delim, op)) != buf.npos){
tok.push_back(string(&buf[op], np-op));
if((op=buf.find_first_not_of(delim, np)) == buf.npos)
break;
}
tok.push_back(string(&buf[op]));

cout << buf << endl;
copy(tok.begin(), tok.end(), ostream_iterator<string>(cout,
"\n"));
cout << endl;
tok.clear();
}
return 0;
}

The inner loop basically finds tokens delimited by any character in
delim where multiple delimiters may appear between tokens (algorithm
follows some advice found on clc++). However, the method seems a little
clumsy, especially with respect to temporary objects. (Also, it does not
seem to work correctly. For example, the last token gets corrupted in
the second outer loop iteration.)

Also, it would be very nice to have a function like

int tokenize(const string& s, container<string>& c);

which returns the number of tokens, inserted into the container.
However, how do you write this so c is any container model? I''m not sure
you can since they don''t share a base class. Is there any better way?
I find your code interesting, so I''ll probably play around
with it for a bit, and let you know if I have any ideas.

But here''s some food for thought: one way to ''generalize''
container access is with iterators, as do the functions in
<algorithm>.

template <typename T>
T::size_type tokenize(const std::string& s,
T::iterator beg,
T::iterator end)
{
}

For inserting new elements, you can use an iterator
adapter, e.g. std::insert_iterator. You could even
use an output stream as a ''container'' using
ostream_iterator.

Certainly, this is easy to do with a mix of C and C++:

for(char *t=strtok(buf, delim); t != 0; t=strtok(0, delim))
tok.push_back(t);
This contradicts your parameter type of const reference to string,
since ''strtok()'' modifies its argument.
where buf and delim are essentially char*''s. However, this seems
unsatisfactory as well.



Yes, ''strtok()'' can be problematic, if only for the reason that
it modifies its argument, necessitating creation of a copy if
you want to keep the argument const.

HTH,
-Mike


"Mike Wahler" <mk******@mkwahler.net> wrote in message
news:Fk*****************@newsread4.news.pas.earthl ink.net...

template <typename T>
T::size_type tokenize(const std::string& s,
T::iterator beg,
T::iterator end)
{
}



Oops, I meant to make those iterator parameters const refs
as well

const T::iterator& beg, const T::iterator& end

-Mike


Mike Wahler wrote:

[snip]

{ // I tried #include <limits>, but
failed...
What happened?



foo.cc:9: limits: No such file or directory

For some reason, my compiler can''t find the file. Otherwise, I agree
with you...
#include <limits>

std::numeric_limits<char>::max();

should work.
[snip] But here''s some food for thought: one way to ''generalize''
container access is with iterators, as do the functions in
<algorithm>. template <typename T>
T::size_type tokenize(const std::string& s,
T::iterator beg,
T::iterator end)
{
}



This is a nice idea! I knew you should be able to do this, but I
couldn''t see how. Here is the refactored code:

template <typename InsertIter>
void
tokenize(const string& buf, const string& delim, InsertIter ii)
{
string word;
string::size_type sp, ep; // start/end position

sp = 0;
do{
sp = buf.find_first_not_of(delim, sp);
ep = buf.find_first_of(delim, sp);
if(sp != ep){
if(ep == buf.npos)
ep = buf.length();
word = buf.substr(sp, ep-sp);
*ii++ = lc(word);
sp = buf.find_first_not_of(delim, ep+1);
}
}while(sp != buf.npos);

if(sp != buf.npos){
word = buf.substr(sp, buf.length()-sp);
*ii++ = lc(word);
}
}

called as

tokenize(buf, delim, insert_iter<deque<string> >(tokens,
tokens.begin()));

The orignal spec returned the number of tokens parsed. Now I have to
settle for checking

if(tokens.size() > 0){ ... }

/david

--
Andre, a simple peasant, had only one thing on his mind as he crept
along the East wall: ''Andre, creep... Andre, creep... Andre, creep.''
-- unknown


这篇关于字符串标记化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆