模式匹配 - 字符串 [英] Pattern Matching - Strings

查看:62
本文介绍了模式匹配 - 字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在c ++书中阅读了这段代码(Sams - C ++ Footprint and Performance Optimization)。



有一篇文章暗示可以进行字符串比较将字符串视为整数更快。



我对它背后的原因有点困惑?通过这样做,我们将减少将每个字符首先转换为整数然后转换为位或其他东西的机器指令数量吗?



任何人都可以详细说明吗?



I read this piece of code in a c++ book(Sams - C++ Footprint and Performance Optimization).

There is an article suggesting that string comparison can be made faster by treating strings as integer.

I am a little confused about the reason behind it? Is it that by doing so we will lessen the number of machine instructions that will take place for converting each character first to an integer and then to bits or something else?

Can anybody please elaborate?

inline int int_strcmp(char *s1, char *s2, int len)
{
    if ((len & 0x03) != 0)
        return(macro_strcmp(s1, s2));

    for (int i =0; *(int*)&s1[i] == *(int*)&s2[i];)
    {
        i+=4;
        if (i >= len)
            return(true);  // match
    }
    return(false);
}







文章(仅供参考)



加速字符串比较的另一种方法是将字符数组视为整数数组。在大多数系统中,整数是字符的四倍,因此可以同时比较四个字符。当字符串的长度是整数长度的倍数时,进行整数比较可以大大加快字符串比较。清单10.1显示了一个整数字符串比较函数的样子。



清单10.1整数字符串比较

inline int int_strcmp(char * s1, char * s2,int len)

{

if((len& 0x03)!= 0)

return(macro_strcmp(s1, s2));



for(int i = 0; *(int *)& s1 [i] == *(int *)& s2 [i ];)

{

i + = 4;

if(i> = len)

return(真正); //匹配

}

返回(假);

}



int_strcmp函数快速检查给定的字符串长度是否为4的倍数。如果不是这种情况,则调用前面讨论的macro_strcmp函数。对于具有正确长度的字符串,通过将字符指针转换为整数指针从而从字符串中读取整数,将字符以四个为一组进行比较。比较字符串越长 - 它们看起来越相似 - 与本章前面给出的实现相比,int_strcmp函数获得的好处越多。对于我们在dbase.txt文件中查找Small Table文章的示例,此整数实现再次更快(最多50%)。这反映在10Source01.cpp程序的结果中。



字符串长度必须是整数大小的倍数才能使此函数工作的原因是任何其他大小都会导致此函数超出字符串的长度。例如,将在两次循环迭代中比较长度为六的字符串。第一次迭代比较前四个字节,这没有问题。第二次迭代比较后四个字节,其中只有两个是实际字符串的一部分。第三个字节将为空,表示字符串的结尾。但是,第四个字节不是字符串的一部分。由于这些第四个字节的值,可以发现两个相同的六个字符的字符串是不同的。当然可以改变int_compare函数来检查不同的字符串大小,但在大多数字符串比较情况下,它很可能仍然会更快。




Article (for reference)

Another way to speed up string comparison is by treating the character arrays as integer arrays . On most systems an integer is four times larger than a character, and so four characters can be compared at the same time. When the strings have a length that is a multiple of the length of an integer, doing integer comparisons can greatly speed up your string comparison. Listing 10.1 shows what an integer string compare function can look like.

Listing 10.1 Integer String Comparison
inline int int_strcmp(char *s1, char *s2, int len)
{
if ((len & 0x03) != 0)
return(macro_strcmp(s1, s2));

for (int i =0; *(int*)&s1[i] == *(int*)&s2[i];)
{
i+=4;
if (i >= len)
return(true); // match
}
return(false);
}

The int_strcmp function quickly checks whether the given string length is a multiple of four. If this is not the case, the previously discussed macro_strcmp function is called. For strings with a correct length, characters are compared in groups of four by casting the character pointers to integer pointers and thus reading integers from the string. The longer the compared strings are—and the more they look alike—the more benefits this int_strcmp function reaps compared to the previous implementations given in this chapter. For our example of finding Small Table articles in the dbase.txt file, this integer implementation is again faster (up to 50%). This is reflected in the results of the 10Source01.cpp program.

The reason why the string lengths have to be a multiple of the integer size for this function to work is that any other size will cause this function to compare beyond the length of a string. A string with a length of six, for instance, will be compared in two loop iterations. The first iteration compares the first four bytes, which is no problem. The second iteration compares the second four bytes, of which only two are part of the actual string. The third byte will be a null indicating the end of the string. The fourth byte, however, is not part of the string. Two strings of six characters that are identical could be found to be different just because of the value of these fourth bytes. The int_compare function can of course be altered to check for different string sizes but it is not very likely that it will still be faster in most string compare cases.

推荐答案

:笑:不,这比那简单。而且更复杂...



字符是8位数量(是的,是的,我知道,他们不必是 - 让它骑... 。)并且很少有现代机器以8位数量本地工作 - 它们具有使用它们的指令,但它们还具有32或64位宽的总线和寄存器组,可以对更多数据执行类似的操作。 />


当你从内存中取一个字节时,系统实际上取32或64位的值,因为这是它的总线大小所以第一个改进是说如果我们正在拉32位数,为什么再次拉下一个字节呢?只需读取一次数据,然后查看它包含的所有4或8个字节,而无需回到处理器外部以获取更多数据。



但是,这不是什么这是关于。如果您正在读取32位(并且读取时间不超过8),为什么不比较32位而不是8位?它可以在单个指令中完成(与8位比较相同),并且需要相同的时间。因此每次比较可以保存三条指令(最小值)。当你考虑字符串大小时,它们中的大多数将长于4个字节 - 所以我们将通过这样做来节省大量时间。
:laugh: No, it''s simpler than that. And also more complicated...

Characters are 8 bit quantities (yes, yes, I know, they don''t have to be - let it ride...) and very few "modern" machines work natively in 8 bit quantities - they have instructions to work with them, but they also have 32 or 64 bit wide buses and register sets which can do similar operations on a lot more data.

When you fetch a byte from memory, the system actually fetches either a 32 or 64 bit value, because that is its bus size so the first improvement is to say "If we are pulling a 32 bit number, why pull it again for the next byte?" Just read the data once, and look at all 4 or 8 bytes it contains without going back outside the processor for more data.

But, that''s not what this is about. If you are reading 32 bits (and it take no longer than reading 8) why not compare 32 bits instead of 8? It can be done in a single instruction (same as an 8 bit compare) and it takes the same time. So it save three instructions (minimum) per compare. And when you think about string sizes, most of them will be longer than four bytes anyway - so we will save considerable time by doing this.


引用:

我背后的原因有点困惑吗?通过这样做,我们将减少将每个字符首先转换为整数然后转换为位或其他东西的机器指令数量吗?

am a little confused about the reason behind it? Is it that by doing so we will lessen the number of machine instructions that will take place for converting each character first to an integer and then to bits or something else?



你不要t将每个字符转换为整数,而不是将一组4个字符在时间上解释为单个整数并比较这样的整数。这样,在一台机器比较中你执行 4 字符比较。





作为旁注,它看起来像 int_strcmp 与标准 strcmp 的行为不匹配。后者提供了更多信息(我猜他们必须这样做,因为符合 strcmp 将依赖于字节序。)

此外为什么(新鲜的)地狱)签名指定 int 作为返回类型,而函数返回 true false (看起来不是很整洁)?


You don''t convert each character to an integer, instead you interpret a group of 4 characters at time as a single integer and compare such integer. This way, in a single machine comparison you perform 4 characters comparison.


As a side note, it looks that such int_strcmp doesn''t match the behaviour of the standard strcmp. The latter provides more info (I guess they have to do that because conforming to strcmp would be endian-dependent).
Moreover why (the fresh hell) the signature specifies int as return type while the function returns either true or false (doesn''t look neat to me)?


这篇关于模式匹配 - 字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆