高效的字符串标记器 [英] efficient string tokenizer

查看:54
本文介绍了高效的字符串标记器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种快速将字符串拆分为单个标记的方法。

字符串的典型格式是token1 | token2 | token3 | ... | tokenN |

和令牌数量不同(这就是为什么我使用向量来持有

令牌)。我目前的实现使用以下方法来分割字符串:

vector< char *> tokenize(char * str,char delim)

{

vector< char> theString;

vector< char *> tokenList;

for(int i = 0; i<(int)strlen(str); i ++)

{

if(str [str [ i] == delim)

{

char * pChar = new char [theString.size()+ 1];

memset(pChar ,0,theString.size()+ 1);


for(int f = 0; f<(int)theString.size(); f ++)

{

pChar [f] = theString [f];

}

tokenList.push_back(pChar);

theString.clear();

}

else

{

theString.push_back(str [i]);

}

}

theString.clear();

返回tokenList;

}


我很确定这个函数要大到内联。有没有人

有什么建议可以帮助加快这个功能?

解决方案

Alex写道:
< blockquote class =post_quotes>我正在寻找一种快速的方法将字符串拆分成单独的标记。
字符串的典型格式是token1 | token2 | token3 | ... | tokenN |
和令牌的数量不同(这就是为什么我用一个向量来保存令牌)。我当前的实现使用以下方法来分割字符串:

vector< char *> tokenize(char * str,char delim)
{
vector< char> theString;
vector< char *> tokenList;
for(int i = 0; i<(int)strlen(str); i ++)
{
if(str [i] == delim)
{
char * pChar = new char [theString.size()+ 1];
memset(pChar,0,theString.size()+ 1);

for(int f = 0; f<(int)theString.size(); f ++)
{
pChar [f] = theString [f];
}
tokenList.push_back(pChar );
theString.clear();
}

{
theString.push_back(str [i]);
}
}
theString.clear();
返回tokenList;
}
我很确定这个函数要大到内联。有没有人有什么建议可以帮助加快这个功能?




目前使用的方法构造了两个大概为0大小的向量,

扫描内存块,并且对于每次迭代,向量在其末尾插入一个元素

(这可能导致向量增长

在课程中多次这个功能)。此外,对于循环的每个
迭代,都会调用strlen(),并且每次

都会遇到分隔符内存被动态分配给''pChar ''

(我假设以后删除)。然后有一种可能性就是

是紧跟在函数之后的复制例程,因为我会怀疑必须将tokenList复制到其目标变量中。 />

首先,我将从比较中删除对strlen()的调用,并使用

a变量来存储它返回的值。从技术上讲,你不需要
甚至需要调用strlen()因为你正在扫描字符串

无论如何。你也可以通过预留()一些内存减少向量必须增加的次数,并且你也可以通过消除动态分配来获得一些速度。记忆。也许你可以传递

一个引用或指向目标向量的指针< char *>函数

以防止调用复制例程。


我希望我有所帮助,我的很多C ++知识似乎隐藏在

当我目前没有使用它时我的记忆深度。


祝你好运,

欧内斯特

"阿莱克斯" < AB **** @ ncsu.edu>在消息中写道

新闻:a7 ************************** @ posting.google.c om ...

我正在寻找一种快速的方法将字符串拆分成单独的标记。
字符串的典型格式是token1 | token2 | token3 | ... | tokenN |
并且令牌的数量不同(这就是我使用向量来保存令牌的原因)。我当前的实现使用以下方法来分割字符串:

vector< char *> tokenize(char * str,char delim)
{
vector< char> theString;
vector< char *> tokenList;
for(int i = 0; i<(int)strlen(str); i ++)
{
if(str [i] == delim)
{
char * pChar = new char [theString.size()+ 1];
memset(pChar,0,theString.size()+ 1);

for(int f = 0; f<(int)theString.size(); f ++)
{
pChar [f] = theString [f];
}
tokenList.push_back(pChar );
theString.clear();
}

{
theString.push_back(str [i]);
}
}
theString.clear();
返回tokenList;
}
我很确定这个函数要大到内联。有没有人有什么建议可以帮助加快这个功能?




我会删除memset()调用,因为它设置了所有当你真的只需要将数组的最后一个元素设置为零时,内存为零

。另外,

我会将调用移动到strlen(),因此在

循环的每次迭代都不会调用它。如果你想超越顶部,可以编写如下代码:


inline char * toarray(const vector< char>& v){

char * ret = new char [v.size()+ 1];

if(!v.empty()){

memcpy(ret,& v) [0],v.size());

}

ret [v.size()] =''\ 0'';

返回ret;

}


vector< char *> tokenize(char * in,char delim){

vector< char *>令牌;

vector< char> currToken;


size_t len = strlen(in);

//一些合理,随意的大小猜测

tokens.reserve (len);

currToken.reserve(8);


for(size_t i = 0; i< len; ++ i){

if(in [i] == delim){

tokens.push_back(toarray(currToken));

currToken.clear(); < br $>
}

else {

currToken.push_back(in [i]);

}

}


tokens.push_back(toarray(currToken));

返回代币;

}


但我可能不会为此烦恼。另外,如果代码使用std :: string或std :: vector< char>,我认为这将更简单。

而不是c风格的

字符串,因为c风格的字符串可能很令人头疼,代码使用

std :: vector< char *>发生异常时可能会意外泄漏内存。


-

David Hilsee


31 2004年7月21:36:51 -0700,Alex< ab **** @ ncsu.edu>写道:

我正在寻找一种快速的方法将字符串拆分成单独的标记。
字符串的典型格式是token1 | token2 | token3 | .. 。| tokenN |
和令牌的数量各不相同(这就是我使用向量来保存令牌的原因)。我当前的实现使用以下方法来分割字符串:

vector< char *> tokenize(char * str,char delim)
{
vector< char> theString;
vector< char *> tokenList;
for(int i = 0; i<(int)strlen(str); i ++)
{
if(str [i] == delim)
{
char * pChar = new char [theString.size()+ 1];
memset(pChar,0,theString.size()+ 1);

for(int f = 0; f<(int)theString.size(); f ++)
{
pChar [f] = theString [f];
}
tokenList.push_back(pChar );
theString.clear();
}

{
theString.push_back(str [i]);
}
}
theString.clear();
返回tokenList;
}
我很确定这个函数要大到内联。有没有人有什么建议可以帮助加快这个功能?




相当多。


1 )不要每次在循环中调用strlen(实际上根本不打电话给strlen

)。


2)Don''使用一个临时向量(theString)。


3)当你简单地覆盖相同的字符时,不要使用memset

。 br />

4)不要按值返回向量,而是传递引用。


5)考虑使用向量< string>而不是向量< char *>,它更容易处理,并且如果你有一个使用短字符串优化的

实现,它也可能更高效。 />

总的来说这应该更有效率


void tokenize(const char * str,char delim,vector< string>& tokenList)

{

tokenList.clear();

int start = -1;

for(int i = 0 ; str [i]; ++ i)

{

if(str [i] == delim)

{

tokenList.push_back(string(str + start + 1,str + i));

start = i;

}

}

}


这是未经测试的代码。


john


I''m looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:
vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str[i] == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str[i]);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?

解决方案

Alex wrote:

I''m looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:
vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str[i] == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str[i]);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?



The method used currently constructs two vectors of presumably 0 size,
scans the memory block, and for each iteration a vector has an element
inserted at its end (which could result in either vector growing
multiple times in the course of this function). Also, for every
iteration of the loop, there is a call to strlen(), and for every time
the delimiter is encountered memory is dynamically allocated for ''pChar''
(that I assume is delete later). Then there is a possibility that there
is a copy routine that immediately follows the function, as I would
suspect that tokenList has to be copied into its destination variable.

First, I would remove the call to strlen() from the comparison, and use
a variable that stores the value returned by it. Technically, you don''t
even need the call to strlen() since you are scanning the string
anyways. You could also reduce the number of times the vector has to
grow by reserve()ing some memory, and you would also gain some speed
from eliminating the dynamic allocation of memory. Maybe you could pass
a reference or pointer to the destination vector<char*> to the function
to prevent invoking a copy routine.

I hope I was of help, a lot of my C++ knowledge seems to hide itself in
the depths of my memory when I am not currently working with it.

Good luck,
Ernest


"Alex" <ab****@ncsu.edu> wrote in message
news:a7**************************@posting.google.c om...

I''m looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:
vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str[i] == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str[i]);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?



I''d remove the memset() call since it''s setting all of the memory to zero
when you really only need the last element of the array set to zero. Also,
I''d move the call to strlen() so it isn''t called at every iteration of the
loop. If you wanted to go over the top, you might write code like this:

inline char * toarray( const vector<char>& v ) {
char * ret = new char[ v.size()+1 ];
if ( !v.empty() ) {
memcpy( ret, &v[0], v.size() );
}
ret[ v.size() ] = ''\0'';
return ret;
}

vector<char *> tokenize ( char * in, char delim ) {
vector<char *> tokens;
vector<char> currToken;

size_t len = strlen( in );
// some reasonable, arbitrary guesses for sizes
tokens.reserve( len );
currToken.reserve( 8 );

for( size_t i = 0; i < len; ++i ) {
if ( in[i] == delim ) {
tokens.push_back( toarray(currToken) );
currToken.clear();
}
else {
currToken.push_back( in[i] );
}
}

tokens.push_back( toarray(currToken) );
return tokens;
}

But I probably wouldn''t bother with all of that. Also, I think it would be
simpler if the code used std::string or std::vector<char> instead of c-style
strings, because c-style strings can be a headache and code that uses
std::vector<char*> could accidentally leak memory when an exception occurs.

--
David Hilsee


On 31 Jul 2004 21:36:51 -0700, Alex <ab****@ncsu.edu> wrote:

I''m looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:
vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str[i] == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str[i]);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?



Quite a few.

1) Don''t call strlen every time around the loop (in fact don''t call strlen
at all).

2) Don''t use a temporary vector (theString).

3) Don''t use memset, when you are simply overwriting that same characters
later on.

4) Don''t return vectors by value, pass a reference instead.

5) Considering using vector<string> instead of vector<char*>, its easier
to handle and likely to be more efficient as well if you have an
implementation that uses short string optimisation.

Overall this should be a lot more efficicent

void tokenize(const char* str, char delim, vector<string>& tokenList)
{
tokenList.clear();
int start = -1;
for (int i = 0; str[i]; ++i)
{
if (str[i] == delim)
{
tokenList.push_back(string(str + start + 1, str + i));
start = i;
}
}
}

This is untested code.

john


这篇关于高效的字符串标记器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆