最快的方式来获得字符串IPv4地址 [英] Fastest way to get IPv4 address from string

查看:141
本文介绍了最快的方式来获得字符串IPv4地址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下的code比inet_addr约快7倍。我在想,如果有改善本作某种程度上,它甚至更快,或者更快的替代方案存在。

这code要求一个有效的空终止IPv4地址没有空格,这在我的情况总是方式提供的,所以我对这种情况进行了优化。通常你将有更多的错误检查,但如果有一种方法,使更快以下或更快的替代方案存在我真的AP preciate它。

  UINT32 GetIP(为const char * P)
{
    UINT32 dwIP = 0,dwIP_Part = 0;
    而(真)
    {
        如果(P [0] == 0)
        {
            dwIP =(dwIP&所述;&下; 8)| dwIP_Part;
            打破;
        }
        如果(P [0] =='。')
        {
            dwIP =(dwIP&所述;&下; 8)| dwIP_Part;
            dwIP_Part = 0;
           p ++;
        }
        dwIP_Part =(dwIP_Part * 10)+(P [0] - '0');
        p ++;
    }
    返回dwIP;
}


解决方案

由于我们正在谈论,我建议用一个量化的解决方案,最大限度地提高吞吐量的IP地址的解析。

下面是针对x86的快速解决方案(需要SSE4.1,或者至少SSSE3贫困):

__ m128i shuffleTable [65536] //可以减小256X倍,见@IwillnotexistIdonotexistUINT32 MyGetIP(为const char *海峡){
    __m128i输入= _mm_lddqu_si128((常量__m128i *)STR); //\"192.167.1.3
    输入= _mm_sub_epi8(输入,_mm_set1_epi8('0')); // 1 9 2 254 1 6 7 1 254 254 3 208 245 0 8 40
    __m128i CMP =输入; // ... X ... ... X.X.XX(标志)
    UINT32面膜= _mm_movemask_epi8(CMP); // 6792 - 神奇指数
    __m128i SHUF = shuffleTable [面具] // 10 -1 -1 -1 8 -1 -1 -1 6 5 4 -1 2 1 0 -1
    __m128i ARR = _mm_shuffle_epi8(输入,SHUF); // 3 0 0 0 | 1 0 0 0 | 7 6 1 0 | 2 9 1 0
    __m128i coeffs = _mm_set_epi8(0,100,10,1,0,100,10,1,0,100,10,1,0,100,10,1);
    __m128i PROD = _mm_maddubs_epi16(coeffs,ARR); // 3 0 | 1 0 | 67 100 | 92 100
    PROD = _mm_hadd_epi16(PROD,PROD); // 3 | 1 | 167 | 192 | ? | ? | ? | ?
    __m128i IMM = _mm_set_epi8(-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,6,4,2,0);
    PROD = _mm_shuffle_epi8(PROD,IMM); // 3 1 167 192 0 0 0 0 0 0 0 0 0 0 0 0
    返回_mm_extract_epi32(PROD,0);
//返回(UINT32(_mm_extract_epi16(PROD,1))≤;&下; 16)+ UINT32(_mm_extract_epi16(PROD,0)); //没有SSE 4.1
}

这是 shuffleTable 所需的precalculation:

无效MyInit(){
    memset的(shuffleTable,-1,的sizeof(shuffleTable));
    INT LEN [4];
    为(LEN [0] = 1; LEN [0]&下; = 3; LEN [0] ++)
        为(LEN [1] = 1; LEN [1] = 3; LEN [1] ++)
            为(LEN [2] = 1; LEN [2]所述; = 3; LEN [2] ++)
                为(LEN [3] = 1; LEN [3]&下; = 3; LEN [3] ++){
                    INT SLEN = len个[0] + LEN [1] + LEN [2] + LEN [3] + 4;
                    INT REM = 16 - SLEN;
                    对(INT rmask = 0; rmask&。1所述;&下;对物; rmask ++){
// {INT rmask =(1 <<;&下; REM)-1; //注意:只有最大rmask是可能的,如果字符串是零填充
                        INT面膜= 0;
                        炭SHUF [16] = {-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, - 1,-1};
                        INT POS = 0;
                        对(INT I = 0; I&下; 4;我++){
                            对于(INT J = 0; J&LT; LEN [I]; J ++){
                                SHUF [(3-i)的* 4 +(LEN [Ⅰ] -1-j)条] = POS;
                                POS ++;
                            }
                            掩模^ =(1 <<;&下;正);
                            POS ++;
                        }
                        面膜^ =(rmask&LT;&LT; SLEN);
                        _mm_store_si128(安培; shuffleTable [面具] _mm_loadu_si128((__ m128i *)SHUF));
                    }
                }
}

全部code。与测试是avaliable 这里。在Ivy Bridge处理器其打印:

  C0A70103
时间= 0.406(1556701184)
时间= 3.133(1556701184)

这意味着该建议的解决方案更快的 7.8倍吞吐量比OP的code条款。它处理的 336每秒百万(单3.4 GHz的核心)地址。

现在我会尽力解释它是如何工作的。请注意,在上市的每一行你可以看到刚才计算的值的内容。所有的阵列都印在little-endian顺序(虽然设置内在使用大端)。

首先,我们通过 LDDQU 指令从加载地址不对齐的16个字节。需要注意的是,在64位模式存储器是由16字节数据块分配,因此这将自动工作得很好。在32位可能在理论上会导致超出范围的访问问题。虽然我不相信它真的能。随后code将在后的,最终字节正常工作,无论价值。无论如何,你最好确保每个IP地址需要存储至少16个字节。

然后,我们从所有的字符减去0。之后 '。'变成-2和零变成-48,所有的数字保持非负。现在我们采取的所有字节的标志位掩码用 _mm_movemask_epi8

根据这个面膜的价值,我们取从查找表中的非平凡的16字节洗牌面膜 shuffleTable 。该表是相当大的:1MB总。它需要相当长的一段时间,precompute。然而,它没有考虑在CPU高速缓存precious空间,因为从该表中仅81元件真正使用。这是因为IP地址的各部分既可以是一个,两个,三个数字长=>因此81中总的变体。
需要注意的是随机没用字节后字符串的结尾可能会在原则上事业的查找表中增加了内存占用。

修改的:你可以找到@IwillnotexistIdonotexist的意见修改版本,仅使用的4K大小的查找表(这是一个慢一点,虽然)

巧妙 _mm_shuffle_epi8 内在使我们能够重新安排我们的洗牌面具的字节数。因此XMM寄存器包含四个4字节的块,每个块包含little-endian顺序的数字。我们每个块转换成16位数字按 _mm_maddubs_epi16 然后按 _mm_hadd_epi16 。然后我们重新排序寄存器字节,使整个IP地址占用较低4个字节。

最后,我们提取从XMM寄存器对GP寄存器较低的4个字节。它与SSE4.1内在做( _mm_extract_epi32 )。如果没有它,使用 _mm_extract_epi16 ,但它会运行一个慢一点与其他线路替换它。

最后,这里是生成的程序集(MSVC2013),这样就可以检查你的编译器不产生任何可疑:

  LDDQU将xmm1,XMMWORD PTR [RCX]
psubb将xmm1,xmm6
PMOVMSKB ECX,将xmm1
MOV ECX,ECX //没用,看@PeterCordes和@IwillnotexistIdonotexist
添加RCX,RCX //可以删除,请参阅@EvgenyKluev
PSHUFB将xmm1,XMMWORD PTR [R13 + RCX * 8]
MOVDQA XMM0,XMM8
pmaddubsw XMM0,xmm1中
phaddw XMM0,XMM0
PSHUFB XMM0,XMM7
pextrd EAX,XMM0,0

P.S。如果你还在读书呢,一定要检查出的意见=)

I have the following code which is about 7 times faster than inet_addr . I was wondering if there is a way to improve this to make it even faster or if a faster alternative exists.

This code requires that a valid null terminated IPv4 address is supplied with no whitespace, which in my case is always the way, so I optimized for that case. Usually you would have more error checking, but if there is a way to make the following even faster or a faster alternative exists I would really appreciate it.

UINT32 GetIP(const char *p)
{
    UINT32 dwIP=0,dwIP_Part=0;
    while(true)
    {
        if(p[0] == 0)
        {
            dwIP = (dwIP << 8) | dwIP_Part;
            break;
        }
        if(p[0]=='.') 
        {       
            dwIP = (dwIP << 8) | dwIP_Part;                     
            dwIP_Part = 0;
           p++;
        }
        dwIP_Part = (dwIP_Part*10)+(p[0]-'0');
        p++;
    }
    return dwIP;
}

解决方案

Since we are speaking about maximizing throughput of IP address parsing, I suggest using a vectorized solution.

Here is x86-specific fast solution (needs SSE4.1, or at least SSSE3 for poor):

__m128i shuffleTable[65536];    //can be reduced 256x times, see @IwillnotexistIdonotexist

UINT32 MyGetIP(const char *str) {
    __m128i input = _mm_lddqu_si128((const __m128i*)str);   //"192.167.1.3"
    input = _mm_sub_epi8(input, _mm_set1_epi8('0'));        //1 9 2 254 1 6 7 254 1 254 3 208 245 0 8 40 
    __m128i cmp = input;                                    //...X...X.X.XX...  (signs)
    UINT32 mask = _mm_movemask_epi8(cmp);                   //6792 - magic index
    __m128i shuf = shuffleTable[mask];                      //10 -1 -1 -1 8 -1 -1 -1 6 5 4 -1 2 1 0 -1 
    __m128i arr = _mm_shuffle_epi8(input, shuf);            //3 0 0 0 | 1 0 0 0 | 7 6 1 0 | 2 9 1 0 
    __m128i coeffs = _mm_set_epi8(0, 100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1);
    __m128i prod = _mm_maddubs_epi16(coeffs, arr);          //3 0 | 1 0 | 67 100 | 92 100 
    prod = _mm_hadd_epi16(prod, prod);                      //3 | 1 | 167 | 192 | ? | ? | ? | ?
    __m128i imm = _mm_set_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 6, 4, 2, 0);
    prod = _mm_shuffle_epi8(prod, imm);                     //3 1 167 192 0 0 0 0 0 0 0 0 0 0 0 0
    return _mm_extract_epi32(prod, 0);
//  return (UINT32(_mm_extract_epi16(prod, 1)) << 16) + UINT32(_mm_extract_epi16(prod, 0)); //no SSE 4.1
}

And here is the required precalculation for shuffleTable:

void MyInit() {
    memset(shuffleTable, -1, sizeof(shuffleTable));
    int len[4];
    for (len[0] = 1; len[0] <= 3; len[0]++)
        for (len[1] = 1; len[1] <= 3; len[1]++)
            for (len[2] = 1; len[2] <= 3; len[2]++)
                for (len[3] = 1; len[3] <= 3; len[3]++) {
                    int slen = len[0] + len[1] + len[2] + len[3] + 4;
                    int rem = 16 - slen;
                    for (int rmask = 0; rmask < 1<<rem; rmask++) {
//                    { int rmask = (1<<rem)-1;    //note: only maximal rmask is possible if strings are zero-padded
                        int mask = 0;
                        char shuf[16] = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1};
                        int pos = 0;
                        for (int i = 0; i < 4; i++) {
                            for (int j = 0; j < len[i]; j++) {
                                shuf[(3-i) * 4 + (len[i]-1-j)] = pos;
                                pos++;
                            }
                            mask ^= (1<<pos);
                            pos++;
                        }
                        mask ^= (rmask<<slen);
                        _mm_store_si128(&shuffleTable[mask], _mm_loadu_si128((__m128i*)shuf));
                    }
                }
}

Full code with testing is avaliable here. On Ivy Bridge processor it prints:

C0A70103
Time = 0.406   (1556701184)
Time = 3.133   (1556701184)

It means that the suggested solution is 7.8 times faster in terms of throughput than the code by OP. It processes 336 millions of addresses per second (single core of 3.4 Ghz).

Now I'll try to explain how it works. Note that on each line of the listing you can see contents of the value just computed. All the arrays are printed in little-endian order (though set intrinsics use big-endian).

First of all, we load 16 bytes from unaligned address by lddqu instruction. Note that in 64-bit mode memory is allocated by 16-byte chunks, so this works well automatically. On 32-bit it may theoretically cause issues with out of range access. Though I do not believe that it really can. The subsequent code would work properly regardless of the values in the after-the-end bytes. Anyway, you'd better ensure that each IP address takes at least 16 bytes of storage.

Then we subtract '0' from all the chars. After that '.' turns into -2, and zero turns into -48, all the digits remain nonnegative. Now we take bitmask of signs of all the bytes with _mm_movemask_epi8.

Depending on the value of this mask, we fetch a nontrivial 16-byte shuffling mask from lookup table shuffleTable. The table is quite large: 1Mb total. And it takes quite some time to precompute. However, it does not take precious space in CPU cache, because only 81 elements from this table are really used. That is because each part of IP address can be either one, two, three digits long => hence 81 variants in total. Note that random trashy bytes after the end of the string may in principle cause increased memory footprint in the lookup table.

EDIT: you can find a version modified by @IwillnotexistIdonotexist in comments, which uses lookup table of only 4Kb size (it is a bit slower, though).

The ingenious _mm_shuffle_epi8 intrinsic allows us to reorder the bytes with our shuffle mask. As a result XMM register contains four 4-byte blocks, each block contains digits in little-endian order. We convert each block into a 16-bit number by _mm_maddubs_epi16 followed by _mm_hadd_epi16. Then we reorder bytes of the register, so that the whole IP address occupies the lower 4 bytes.

Finally, we extract the lower 4 bytes from the XMM register to GP register. It is done with SSE4.1 intrinsic (_mm_extract_epi32). If you don't have it, replace it with other line using _mm_extract_epi16, but it will run a bit slower.

Finally, here is the generated assembly (MSVC2013), so that you can check that your compiler does not generate anything suspicious:

lddqu   xmm1, XMMWORD PTR [rcx]
psubb   xmm1, xmm6
pmovmskb ecx, xmm1
mov ecx, ecx               //useless, see @PeterCordes and @IwillnotexistIdonotexist
add rcx, rcx               //can be removed, see @EvgenyKluev
pshufb  xmm1, XMMWORD PTR [r13+rcx*8]
movdqa  xmm0, xmm8
pmaddubsw xmm0, xmm1
phaddw  xmm0, xmm0
pshufb  xmm0, xmm7
pextrd  eax, xmm0, 0

P.S. If you are still reading it, be sure to check out comments =)

这篇关于最快的方式来获得字符串IPv4地址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆