WebSocket的数据揭露/多字节XOR [英] Websocket data unmasking / multi byte xor




websocket spec defines unmasking data as

j                   = i MOD 4
transformed-octet-i = original-octet-i XOR masking-key-octet-j


where mask is 4 bytes long and unmasking has to be applied per byte.


Is there a way to do this more efficiently, than to just loop bytes?

服务器运行code可以假定为Haswell的CPU,操作系统是Linux的内核> 3.2,所以SSE等都是present。编码是用C做,但如果需要,我可以做ASM为好。

Server running the code can assumed to be a Haswell CPU, OS is Linux with kernel > 3.2, so SSE etc are all present. Coding is done in C, but I can do asm as well if necessary.

我倒是想看看了自己的解决方案,但无法弄清楚是否有任何数十SSE1-5 / AVE /的一个恰当的指令(无论扩展 - 记不清了许多多年来)

I'd tried to look up the solution myself, but was unable to figure out if there was an appropriate instruction in any of the dozens of SSE1-5/AVE/(whatever extension - lost track of the many over the years)



After rereading the spec a couple of times it seems that it's actually only XOR'ing the data bytes with the mask bytes, which I can do 8 bytes at a time till the last few bytes. Question is still open, as I think there could probably be still a way to optimize this using SSE or the like (maybe processing even 16 bytes at a time? letting the process do the for loop? ...)


是的,你可以在XOR一条指令16字节使用SSE2,或同时与AVX2 32字节(Haswell的和更高版本)。

Yes, you can XOR 16 bytes in one instruction using SSE2, or 32 bytes at a time with AVX2 (Haswell and later).


#include <emmintrin.h>                     // SSE2 instrinsics

__m128i v, v_mask;
uint8_t *buff;                             // buffer - must be 16 byte aligned

for (int i = 0; i < N; i += 16)            // note that N must be multiple of 16
    v = _mm_load_si128(&buff[i]);          // load 16 bytes
    v = _mm_xor_si128(v, v_mask);          // XOR with mask
    v = _mm_store_si128(&buff[i], v);      // store 16 masked bytes


#include <immintrin.h>                     // AVX2 intrinsics

__m256i w, w_mask;
uint8_t *buff;                             // buffer - must be 16 byte aligned,
                                           // and preferably 32 byte aligned

for (int i = 0; i < N; i += 32)            // note that N must be multiple of 32
    w = _mm256_load_si256(&buff[i]);       // load 32 bytes
    w = _mm256_xor_si256(w, w_mask);       // XOR with mask
    w = _mm256_store_si256(&buff[i], w);   // store 32 masked bytes


登录 关闭
发送“验证码”获取 | 15天全站免登陆