什么是“vperm v0,v0,v0,v17”与未使用的V0做? [英] What does "vperm v0,v0,v0,v17" with unused v0 do?

查看:296
本文介绍了什么是“vperm v0,v0,v0,v17”与未使用的V0做?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在制作

解决方案

乍一看,你已经完成了所有繁重的工作,截图看起来很像它将成为LE endian交换部分。我假设你在这里发现。我期望v17是掩码变量 - 它从早先的TOC中加载为vs49。



缺少的关键信息是v0是vs32(无尽地令人困惑的我知道)。我不确定在哪里展示这个最好的地方,但ABI会做。你可以在这里下载: https://members.openpowerfoundation.org/document/dl/576



图2-17。向量寄存器作为VSR的一部分在第44页应该有助于说明我的意思,这就是它在硬件中的样子。


I'm working on an SHA-256 implementation using Power8 built-ins. The performance is off a bit. I estimate it is off by about 2 cycles per byte (cpb).

The C/C++ code to perform SHA on a block looks like so:

// Schedule 64-byte message
SHA256_SCHEDULE(W, data);

uint32x4_p8 a = abcd, e = efgh;
uint32x4_p8 b = VectorShiftLeft<4>(a);
uint32x4_p8 f = VectorShiftLeft<4>(e);
uint32x4_p8 c = VectorShiftLeft<4>(b);
uint32x4_p8 g = VectorShiftLeft<4>(f);
uint32x4_p8 d = VectorShiftLeft<4>(c);
uint32x4_p8 h = VectorShiftLeft<4>(g);

for (unsigned int i=0; i<64; i+=4)
{
    const uint32x4_p8 k = VectorLoad32x4u(K, i*4);
    const uint32x4_p8 w = VectorLoad32x4u(W, i*4);
    SHA256_ROUND<0>(w,k, a,b,c,d,e,f,g,h);
    SHA256_ROUND<1>(w,k, a,b,c,d,e,f,g,h);
    SHA256_ROUND<2>(w,k, a,b,c,d,e,f,g,h);
    SHA256_ROUND<3>(w,k, a,b,c,d,e,f,g,h);
}

I compile the program with GCC using -O3 and -mcpu=power8 on a ppc64-le machine. When I look at the disassembly I see a several of these:

...
10000b0c:   a6 03 09 7d     mtctr   r8
10000b10:   57 02 00 f0     xxswapd vs32,vs32
10000b14:   6b 04 00 10     vperm   v0,v0,v0,v17
10000b18:   57 02 00 f0     xxswapd vs32,vs32
10000b1c:   99 57 00 7c     stxvd2x vs32,0,r10
10000b20:   99 26 0c 7c     lxvd2x  vs32,r12,r4
10000b24:   57 02 00 f0     xxswapd vs32,vs32
10000b28:   6b 04 00 10     vperm   v0,v0,v0,v17
10000b2c:   57 02 00 f0     xxswapd vs32,vs32
10000b30:   99 67 0a 7c     stxvd2x vs32,r10,r12
10000b34:   99 26 0b 7c     lxvd2x  vs32,r11,r4
10000b38:   57 02 00 f0     xxswapd vs32,vs32
10000b3c:   6b 04 00 10     vperm   v0,v0,v0,v17
10000b40:   57 02 00 f0     xxswapd vs32,vs32
10000b44:   99 5f 0a 7c     stxvd2x vs32,r10,r11
10000b48:   99 26 05 7c     lxvd2x  vs32,r5,r4
10000b4c:   57 02 00 f0     xxswapd vs32,vs32
10000b50:   6b 04 00 10     vperm   v0,v0,v0,v17
10000b54:   57 02 00 f0     xxswapd vs32,vs32
10000b58:   99 2f 0a 7c     stxvd2x vs32,r10,r5
...

The vperm v0,v0,v0,v17 seem like dead instructions because v0 is not used after the permutation.

What does vperm v0,v0,v0,v17 do?


The C++ source code is available at sha256-p8.cxx.

The source file was compiled with g++ -g3 -O3 -Wall -DTEST_MAIN -mcpu=power8 sha256-2-p8.cxx -o sha256-2-p8.exe.

The complete disassembly s available at PPC64 SHA-256 disassembly.


I think the fragment above is being produced by SHA256_SCHEDULE. I see the collection of VectorShiftLeft (vsldoi) after the block in question.

To zero in even more I'm fairly certain it is the endian-swapper for the first 16-words:

const uint8x16_p8 mask = {3,2,1,0, 7,6,5,4, 11,10,9,8, 15,14,13,12};
for (unsigned int i=0; i<16; i+=4)
    VectorStore32x4u(VectorPermute32x4(VectorLoad32x4u(data, i*4), mask), W, i*4);

SHA256_SCHEDULE looks like so:

// +2 because Schedule reads beyond the last element
void SHA256_SCHEDULE(uint32_t W[64+2], const uint8_t* data)
{
#if (__LITTLE_ENDIAN__)
    const uint8x16_p8 mask = {3,2,1,0, 7,6,5,4, 11,10,9,8, 15,14,13,12};
    for (unsigned int i=0; i<16; i+=4)
        VectorStore32x4u(VectorPermute32x4(VectorLoad32x4u(data, i*4), mask), W, i*4);
#else
    for (unsigned int i=0; i<16; i+=4)
        VectorStore32x4u(VectorLoad32x4u(data, i*4), W, i*4);
#endif

    // At i=62, W[i-2] reads the 65th and 66th elements. W[] has 2 extra "don't care" elements.
    for (unsigned int i = 16; i < 64; i+=2)
    {
        const uint32x4_p8 s0 = Vector_sigma0(VectorLoad32x4u(W, (i-15)*4));
        const uint32x4_p8 w0 = VectorLoad32x4u(W, (i-16)*4);
        const uint32x4_p8 s1 = Vector_sigma1(VectorLoad32x4u(W, (i-2)*4));
        const uint32x4_p8 w1 = VectorLoad32x4u(W, (i-7)*4);

        const uint32x4_p8 r = vec_add(s1, vec_add(w1, vec_add(s0, w0)));
        VectorStore32x4u(r, W, i*4);
    }
}


Here is an image of the section in question with v0 highlighted.

解决方案

At first glance you've done all the heavy lifting, that screenshot looks a lot like its going to be the LE endian swapper section. I'm assuming you're spot on here. I expect that v17 is the mask variable - it's loaded as vs49 from the TOC earlier on.

The key piece of information you're missing is that v0 is vs32 (endlessly confusing I know). I'm not sure where the best place to demonstrate this is but the ABI will do. You can download it here: https://members.openpowerfoundation.org/document/dl/576.

Figure 2-17. Vector Registers as Part of VSRs on page 44 should help to illustrate what I mean, this is how it is in the hardware.

这篇关于什么是“vperm v0,v0,v0,v17”与未使用的V0做?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆