Tokenizer功能(加上对strtok文档的咆哮) [英] Tokenizer Function (plus rant on strtok documentation)

查看:82
本文介绍了Tokenizer功能(加上对strtok文档的咆哮)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

几天前,我决定强迫自己真正学习

究竟是什么strtok是的,以及如何使用它。我认为我会在一本书中查看它是否只需查看。


我错了!

首先,Bjarne Stroustrup的The C ++ Programming Language说:


(没什么)


好​​吧,C书怎么样? Steven Prata'的C Primer Plus说:


(无)


Aaarrrggg。好吧,好老兰迪席尔特和他的书怎么样?b $ b" C ++:Complete Reference"?它说:


#include< cstring>

char * strtok(char * str1,const char * str2);

strtok()函数返回指向

中下一个标记的指针,该字符串由str1指向的字符串。构成str2指向的

字符串的字符是确定令牌的b / b
的分隔符。当没有令牌

返回时返回空指针。要对字符串进行标记化,第一次调用strtok()

必须将str1指向要标记的字符串。随后的

调用必须使用str1的空指针。通过这种方式,整个

字符串可以缩减为其标记。每次调用strtok()都可以使用

不同的分隔符。


好​​的。但是当我尝试使用这个功能的时候,它并没有做我想要的b $ b预期。首先,它严重改变了第一个参数的内容

。 Randy Schildt的书并没有提到

一点都没有。 :-( Bad Randy!


我不得不谷歌这个功能并在网上找到它的信息

为了找出它是如何工作的。原来,有很多

缺少的东西是Schildt的描述。(但是,嘿,至少

他试过。大多数其他C / C ++作者鸡出来并且甚至不会在他们的书中触摸st
。这就是这个函数真的如何工作:

http://www.opengroup.org/onlinepubs/.../strtok_r。 html


我希望更多的作者能够在他们的

书中涵盖这个有用的功能。毕竟,它是C和C的一部分。 C ++标准

库。好的,我现在已经完成了咆哮。

为了你的娱乐,这是我写的一个函数来打破一个字符串

到令牌,给定一串分隔符字符,并将令牌放入

的std ::矢量<的std :: string。我确定可以通过各种方式改善这种情况。评论?吊索?箭头?

无效

Tokenize



std :: string const& RawText,

std :: string const& Delimiters,

std :: vector< std :: string& Tokens



{

//将原始文本加载到适当大小的动态字符数组中:

size_t StrSize = RawText.size();

size_t ArraySize = StrSize + 5;

char * Ptr = new char [ArraySize];

memset(Ptr,0,ArraySize);

strncpy( Ptr,RawText.c_str(),StrSize);


//清除令牌向量:

Tokens.clear();


//从数组中获取标记并将它们放在向量中:

char * TokenPtr = NULL;

char * TempPtr = Ptr;

while(NULL!=(TokenPtr = strtok(TempPtr,Delimiters.c_str())))

{

Tokens.push_back(std: :string(TokenPtr));

TempPtr = NULL;

}


//可用内存和急救:

删除[] Ptr;

返回;

}

-

干杯,

罗比哈特利

美国加利福尼亚州东塔斯汀

独家狼来自pac bell dot net

(put" [usenet]"受绕过垃圾邮件过滤器的影响)
http://home.pacbell.net/ earnur /

A couple of days ago I dedecided to force myself to really learn
exactly what "strtok" does, and how to use it. I figured I''d
just look it up in some book and that would be that.

I figured wrong!

Firstly, Bjarne Stroustrup''s "The C++ Programming Language" said:

(nothing)

Ok, how about a C book? Steven Prata''s "C Primer Plus" said:

(nothing)

Aaarrrggg. Ok, how about good old Randy Schildt and his book
"C++: Complete Reference"? It said:

#include <cstring>
char *strtok(char *str1, const char *str2);
The strtok() function returns a pointer to the next token in
the string pointed to by str1. The characters making up the
string pointed to by str2 are the delimiters that determine
the token. A null pointer is returned when there is no token
to return.To tokenize a string, the first call to strtok()
must have str1 point to the string being tokenized. Subsequent
calls must use a null pointer for str1. In this way, the entire
string can be reduced to its tokens. It is possible to use a
different set of delimiters for each call to strtok() .

Ok. But when I tried using the function, it didn''t do what I
expected at all. For one thing, it severely alters the contents
of its first argument. Randy Schildt''s book doesn''t mention that
little factoid at all. :-( Bad Randy!

I had to google this function and find info on it on the web in
order to find out how it really works. Turns out, there''s lots
of things missing is Schildt''s description. (But hey, at least
he tried. Most other C/C++ authors chicken out and won''t even
touch strtok in their books.) This is how this function REALLY
works:

http://www.opengroup.org/onlinepubs/.../strtok_r.html

I wish more authors would cover this useful function in their
books. After all, it IS a part of both the C and C++ standard
libraries. Ok, I''m done ranting now.
For your amusement, here is a function I wrote to break a string
into tokens, given a string of "separator" characters, and put
the tokens in a std::vector<std::string. I''m sure there''s
various ways this could be improved. Comments? Slings? Arrows?
void
Tokenize
(
std::string const & RawText,
std::string const & Delimiters,
std::vector<std::string & Tokens
)
{
// Load raw text into an appropriately-sized dynamic char array:
size_t StrSize = RawText.size();
size_t ArraySize = StrSize + 5;
char* Ptr = new char[ArraySize];
memset(Ptr, 0, ArraySize);
strncpy(Ptr, RawText.c_str(), StrSize);

// Clear the Tokens vector:
Tokens.clear();

// Get the tokens from the array and put them in the vector:
char* TokenPtr = NULL;
char* TempPtr = Ptr;
while (NULL != (TokenPtr = strtok(TempPtr, Delimiters.c_str())))
{
Tokens.push_back(std::string(TokenPtr));
TempPtr = NULL;
}

// Free memory and scram:
delete[] Ptr;
return;
}
--
Cheers,
Robbie Hatley
East Tustin, CA, USA
lone wolf intj at pac bell dot net
(put "[usenet]" in subject to bypass spam filter)
http://home.pacbell.net/earnur/

推荐答案

Robbie Hatley写道:
Robbie Hatley wrote:

几个几天前,我决定强迫自己真正学习

究竟是什么strtok是的,以及如何使用它。
这就是这个函数真的如何工作:

http://www.opengroup.org/onlinepubs/.../strtok_r.html


我希望更多作者在他们的

书中涵盖这个有用的功能。毕竟,它是C和C ++标准

库的一部分。好吧,我现在已经完成了咆哮。
A couple of days ago I dedecided to force myself to really learn
exactly what "strtok" does, and how to use it.
This is how this function REALLY
works:

http://www.opengroup.org/onlinepubs/.../strtok_r.html

I wish more authors would cover this useful function in their
books. After all, it IS a part of both the C and C++ standard
libraries. Ok, I''m done ranting now.



strtok是保持内部状态的奇怪功能之一,所以

你不能以交错方式标记两个字符串或在多线程程序中使用它/ b $ b。 POSIX提供了一个strtok_r,有点

saner。

strtok is one of the weird functions that maintain internal state, so
that you cannot tokenize two strings in an interleaved manner or use it
in a multithreaded program. POSIX offers a strtok_r which is somewhat
saner.


>

为了您的娱乐,这里是我写的一个函数来打破一个字符串

到令牌,给定一串分隔符,字符,并将stk :: vector< std :: string中的标记放入

。我确定可以通过各种方式改善这种情况。评论?吊索?箭头?


无效

Tokenize



std :: string const& RawText,

std :: string const& Delimiters,

std :: vector< std :: string& Tokens



{

//将原始文本加载到适当大小的动态字符数组中:

size_t StrSize = RawText.size();

size_t ArraySize = StrSize + 5;

char * Ptr = new char [ArraySize];

memset(Ptr,0,ArraySize);

strncpy(Ptr,RawText.c_str(),StrSize);


//清除令牌向量:

Tokens.clear();


//从数组中获取标记并将它们放在向量中:

char * TokenPtr = NULL;

char * TempPtr = Ptr;

while(NULL!=(TokenPtr = strtok(TempPtr,Delimiters.c_str())))

{

代币。的push_back(的std :: string(TokenPtr));

TempPtr = NULL;

}


//可用内存和急停:

delete [] Ptr;

返回;

}
>
For your amusement, here is a function I wrote to break a string
into tokens, given a string of "separator" characters, and put
the tokens in a std::vector<std::string. I''m sure there''s
various ways this could be improved. Comments? Slings? Arrows?
void
Tokenize
(
std::string const & RawText,
std::string const & Delimiters,
std::vector<std::string & Tokens
)
{
// Load raw text into an appropriately-sized dynamic char array:
size_t StrSize = RawText.size();
size_t ArraySize = StrSize + 5;
char* Ptr = new char[ArraySize];
memset(Ptr, 0, ArraySize);
strncpy(Ptr, RawText.c_str(), StrSize);

// Clear the Tokens vector:
Tokens.clear();

// Get the tokens from the array and put them in the vector:
char* TokenPtr = NULL;
char* TempPtr = Ptr;
while (NULL != (TokenPtr = strtok(TempPtr, Delimiters.c_str())))
{
Tokens.push_back(std::string(TokenPtr));
TempPtr = NULL;
}

// Free memory and scram:
delete[] Ptr;
return;
}



我想将tokenizer绑定到vector< stringis不是个好主意。如果它是一个输出迭代器,它可以和任何容器一起使用,甚至可以和ostream_iterators这样的东西使用
。这是我的尝试,这也是

摆脱strtok:


#include< string>

using namespace std;

模板< class OItervoid tokenize(const string& str,

const string& delim,

OIter oi)

{

typedef string :: size_type Sz;


Sz begin = 0;

while(begin< str .size()){

Sz end = str.find_first_of(delim,begin);

* oi ++ = str.substr(begin,end-begin);

begin = str.find_first_not_of(delim,end);

}

}


我使用find_first_not_of为了兼容strtok的

将多个相邻分隔符视为单个

分隔符的行为。我还没有测量这个版本的性能对

strtok版本。

I guess tying the tokenizer to vector<stringis not a good idea. If it
took an output iterator it could be used with any container or even
with things like ostream_iterators. Here is my attempt, which also gets
rid of strtok:

#include <string>
using namespace std;
template <class OItervoid tokenize( const string &str,
const string &delim,
OIter oi)
{
typedef string::size_type Sz;

Sz begin=0;
while(begin<str.size()){
Sz end=str.find_first_of(delim,begin);
*oi++=str.substr(begin,end-begin);
begin=str.find_first_not_of(delim,end);
}
}

I use find_first_not_of in order to be compatible with strtok''s
behaviour of treating multiple adjacent delimiters as a single
delimiter. I have not measured the performance of this version against
the strtok version.


jmoy写道:
jmoy wrote:

Robbie Hatley写道:
Robbie Hatley wrote:

几天前我决定强迫自己真正学习

究竟是什么strtok是的,以及如何使用它。

这就是这个函数真的如何工作:

http://www.opengroup.org/onlinepubs/.../strtok_r.html


我希望更多作者在他们的

书中涵盖这个有用的功能。毕竟,它是C和C ++标准

库的一部分。好的,我现在已经完成了咆哮。
A couple of days ago I dedecided to force myself to really learn
exactly what "strtok" does, and how to use it.
This is how this function REALLY
works:

http://www.opengroup.org/onlinepubs/.../strtok_r.html

I wish more authors would cover this useful function in their
books. After all, it IS a part of both the C and C++ standard
libraries. Ok, I''m done ranting now.



strtok是保持内部状态的奇怪函数之一,所以你不能以交错的方式标记两个字符串或使用它
多线程程序中的
。 POSIX提供了一个strtok_r,有点

saner。


strtok is one of the weird functions that maintain internal state, so
that you cannot tokenize two strings in an interleaved manner or use it
in a multithreaded program. POSIX offers a strtok_r which is somewhat
saner.



为了您的娱乐,这是我写的一个函数,用于打破一个字符串

到一个字符串中,给出一串 ;隔板"字符,并将stk :: vector< std :: string中的标记放入

。我确定可以通过各种方式改善这种情况。评论?吊索?箭头?

无效

Tokenize



std :: string const& RawText,

std :: string const& Delimiters,

std :: vector< std :: string& Tokens



{

//将原始文本加载到适当大小的动态字符数组中:

size_t StrSize = RawText.size();

size_t ArraySize = StrSize + 5;

char * Ptr = new char [ArraySize];

memset(Ptr,0,ArraySize);

strncpy( Ptr,RawText.c_str(),StrSize);


//清除令牌向量:

Tokens.clear();


//从数组中获取标记并将它们放在向量中:

char * TokenPtr = NULL;

char * TempPtr = Ptr;

while(NULL!=(TokenPtr = strtok(TempPtr,Delimiters.c_str())))

{

Tokens.push_back(std: :■ tring(TokenPtr));

TempPtr = NULL;

}


//可用内存和急停:

删除[] Ptr;

返回;

}

For your amusement, here is a function I wrote to break a string
into tokens, given a string of "separator" characters, and put
the tokens in a std::vector<std::string. I''m sure there''s
various ways this could be improved. Comments? Slings? Arrows?
void
Tokenize
(
std::string const & RawText,
std::string const & Delimiters,
std::vector<std::string & Tokens
)
{
// Load raw text into an appropriately-sized dynamic char array:
size_t StrSize = RawText.size();
size_t ArraySize = StrSize + 5;
char* Ptr = new char[ArraySize];
memset(Ptr, 0, ArraySize);
strncpy(Ptr, RawText.c_str(), StrSize);

// Clear the Tokens vector:
Tokens.clear();

// Get the tokens from the array and put them in the vector:
char* TokenPtr = NULL;
char* TempPtr = Ptr;
while (NULL != (TokenPtr = strtok(TempPtr, Delimiters.c_str())))
{
Tokens.push_back(std::string(TokenPtr));
TempPtr = NULL;
}

// Free memory and scram:
delete[] Ptr;
return;
}



我想将标记器绑定到vector< stringis不是个好主意。如果它是一个输出迭代器,它可以和任何容器一起使用,甚至可以和ostream_iterators这样的东西使用
。这是我的尝试,这也是

摆脱strtok:


#include< string>

using namespace std;

模板< class OItervoid tokenize(const string& str,

const string& delim,

OIter oi)

{

typedef string :: size_type Sz;


Sz begin = 0;

while(begin< str .size()){

Sz end = str.find_first_of(delim,begin);

* oi ++ = str.substr(begin,end-begin);

begin = str.find_first_not_of(delim,end);

}

}


我使用find_first_not_of为了兼容strtok的

将多个相邻分隔符视为单个

分隔符的行为。我没有测量这个版本的性能,而不是strtok版本的



I guess tying the tokenizer to vector<stringis not a good idea. If it
took an output iterator it could be used with any container or even
with things like ostream_iterators. Here is my attempt, which also gets
rid of strtok:

#include <string>
using namespace std;
template <class OItervoid tokenize( const string &str,
const string &delim,
OIter oi)
{
typedef string::size_type Sz;

Sz begin=0;
while(begin<str.size()){
Sz end=str.find_first_of(delim,begin);
*oi++=str.substr(begin,end-begin);
begin=str.find_first_not_of(delim,end);
}
}

I use find_first_not_of in order to be compatible with strtok''s
behaviour of treating multiple adjacent delimiters as a single
delimiter. I have not measured the performance of this version against
the strtok version.



结帐 http://www.boost.org/libs/tokenizer/index.html

关于增强标记器的一个很酷的事情是你可以得到NULL

令牌,如果你有相邻的分隔符,我认为不能由strtok处理



谢谢和问候

SJ

check out http://www.boost.org/libs/tokenizer/index.html
One cool thing about the boost tokenizer is that you can get NULL
tokens if you have adjacent separators, which I believe can''t be
handled by strtok.

Thanks and regards
SJ




jmoy写道:

jmoy wrote:

Robbie Hatley写道:
Robbie Hatley wrote:

几天前我决定强迫自己真正学习

究竟是什么strtok是的,以及如何使用它。

这就是这个函数真的如何工作:

http://www.opengroup.org/onlinepubs/.../strtok_r.html


我希望更多作者在他们的

书中涵盖这个有用的功能。毕竟,它是C和C ++标准

库的一部分。好的,我现在已经完成了咆哮。
A couple of days ago I dedecided to force myself to really learn
exactly what "strtok" does, and how to use it.
This is how this function REALLY
works:

http://www.opengroup.org/onlinepubs/.../strtok_r.html

I wish more authors would cover this useful function in their
books. After all, it IS a part of both the C and C++ standard
libraries. Ok, I''m done ranting now.



strtok是保持内部状态的奇怪函数之一,所以你不能以交错的方式标记两个字符串或使用它
多线程程序中的
。 POSIX提供了一个strtok_r,有点

saner。


strtok is one of the weird functions that maintain internal state, so
that you cannot tokenize two strings in an interleaved manner or use it
in a multithreaded program. POSIX offers a strtok_r which is somewhat
saner.



为了您的娱乐,这是我写的一个函数,用于打破一个字符串

到一个字符串中,给出一串 ;隔板"字符,并将stk :: vector< std :: string中的标记放入

。我确定可以通过各种方式改善这种情况。评论?吊索?箭头?

无效

Tokenize



std :: string const& RawText,

std :: string const& Delimiters,

std :: vector< std :: string& Tokens



{

//将原始文本加载到适当大小的动态字符数组中:

size_t StrSize = RawText.size();

size_t ArraySize = StrSize + 5;

char * Ptr = new char [ArraySize];

memset(Ptr,0,ArraySize);

strncpy( Ptr,RawText.c_str(),StrSize);


//清除令牌向量:

Tokens.clear();


//从数组中获取标记并将它们放在向量中:

char * TokenPtr = NULL;

char * TempPtr = Ptr;

while(NULL!=(TokenPtr = strtok(TempPtr,Delimiters.c_str())))

{

Tokens.push_back(std: :■ tring(TokenPtr));

TempPtr = NULL;

}


//可用内存和急停:

删除[] Ptr;

返回;

}

For your amusement, here is a function I wrote to break a string
into tokens, given a string of "separator" characters, and put
the tokens in a std::vector<std::string. I''m sure there''s
various ways this could be improved. Comments? Slings? Arrows?
void
Tokenize
(
std::string const & RawText,
std::string const & Delimiters,
std::vector<std::string & Tokens
)
{
// Load raw text into an appropriately-sized dynamic char array:
size_t StrSize = RawText.size();
size_t ArraySize = StrSize + 5;
char* Ptr = new char[ArraySize];
memset(Ptr, 0, ArraySize);
strncpy(Ptr, RawText.c_str(), StrSize);

// Clear the Tokens vector:
Tokens.clear();

// Get the tokens from the array and put them in the vector:
char* TokenPtr = NULL;
char* TempPtr = Ptr;
while (NULL != (TokenPtr = strtok(TempPtr, Delimiters.c_str())))
{
Tokens.push_back(std::string(TokenPtr));
TempPtr = NULL;
}

// Free memory and scram:
delete[] Ptr;
return;
}



我想将标记器绑定到vector< stringis不是个好主意。如果它是一个输出迭代器,它可以和任何容器一起使用,甚至可以和ostream_iterators这样的东西使用
。这是我的尝试,这也是

摆脱strtok:


#include< string>

using namespace std;

模板< class OItervoid tokenize(const string& str,

const string& delim,

OIter oi)

{

typedef string :: size_type Sz;


Sz begin = 0;

while(begin< str .size()){

Sz end = str.find_first_of(delim,begin);

* oi ++ = str.substr(begin,end-begin);

begin = str.find_first_not_of(delim,end);

}

}


I guess tying the tokenizer to vector<stringis not a good idea. If it
took an output iterator it could be used with any container or even
with things like ostream_iterators. Here is my attempt, which also gets
rid of strtok:

#include <string>
using namespace std;
template <class OItervoid tokenize( const string &str,
const string &delim,
OIter oi)
{
typedef string::size_type Sz;

Sz begin=0;
while(begin<str.size()){
Sz end=str.find_first_of(delim,begin);
*oi++=str.substr(begin,end-begin);
begin=str.find_first_not_of(delim,end);
}
}



我喜欢这个实现,但你不觉得数据的空间

(令牌)已经预先分配了吗?

如果我用你的fn类似的东西这个,我得到分段错误..


std :: vector< std :: stringv;
std :: vector< std :: string> :: iterator it = v.begin();

tokenize< std :: vector< std :: string> ::迭代器>(" ab c"," ",it);

I like this implementation, but don''t you assume the space for data
(tokens) is already pre-allocated?
If I use your fn with something like this, I get segmentation fault..

std::vector<std::stringv;
std::vector<std::string>::iterator it = v.begin();
tokenize<std::vector<std::string>::iterator>("a b c", " ", it);


>

我使用find_first_not_of以便与strtok'的兼容

将多个相邻分隔符视为单个

分隔符的行为。我没有测量这个版本的性能,而不是strtok版本的


>
I use find_first_not_of in order to be compatible with strtok''s
behaviour of treating multiple adjacent delimiters as a single
delimiter. I have not measured the performance of this version against
the strtok version.


这篇关于Tokenizer功能(加上对strtok文档的咆哮)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆