近恒定时间旋转,不违反标准 [英] Near constant time rotate that does not violate the standards

查看:257
本文介绍了近恒定时间旋转,不违反标准的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个时间试图想出一个不违反C / C ++标准的恒定时间旋转。



问题是边缘/角落情况,其中在算法中调出操作,并且那些算法不能被改变。例如,以下内容来自 Crypto ++ ,并在 GCC ubsan (即 g ++ fsanitize = undefined ):

  $ ./cryptest.exe v | grep运行时
misc.h:637:22:运行时错误:移位指数32对于32位类型'unsigned int'太大
misc.h:643:22:运行时错误:移位指数32对于32位类型'unsigned int'
misc.h:625:22:运行时错误:移位指数32对于32位类型'unsigned int'来说太大
misc.h: 637:22:运行时错误:对于32位类型unsigned int,移位指数32太大
misc.h:643:22:运行时错误:移位指数32对于32位类型的无符号int'
misc.h:637:22:运行时错误:32位类型unsigned int的移位指数32太大

misc.h:637 的代码:

  template< class T> inline T rotlMod(T x,unsigned int y)
{
y%= sizeof(T)* 8;
return T((x (sizeof(T)* 8-y)))
}

英特尔的ICC非常无情,它删除了整个函数调用 y%= sizeof(T)* 8 。我们几年前就修复了这个问题,但由于缺少一个恒定的时间解决方案,因此将其他勘误留在原地。



还有一个问题。当 y = 0 时,我得到一个条件,其中 32 - y = 32 ,并设置未定义的行为。 如果我添加了 if(y == 0)... 的检查,则代码不能满足常量时间要求。 p>

我看过许多其他实现,从Linux内核到其他加密库。



如何用最小数量的指令在几乎恒定的时间内执行旋转? / p>

指令总是被执行。我不担心CPU微码时序。虽然分支预测在x86 / x64上可能很好,但在其他平台上可能无法正常运行,例如嵌入。






如果 GCC Clang 提供了一个内在函数来执行在接近恒定的时间旋转

解决方案

我已经链接到这个答案来自几个其他轮换问题的完整详细信息,包括此社区Wiki问题,应该保持最新的最佳实践。



我发现了一个关于这个问题的博客文章,它看起来最终是一个解决的问题具有足够新的编译器版本)。



犹他大学的John Regehr 推荐版本c的他的尝试旋转功能。我用一个按位AND替换了他的断言,并发现它仍然编译为单个旋转insn。

  typedef uint32_t rotwidth_t; // parameterize用于比较不同大小的编译器输出

rotwidth_t rotl(rotidth_t x,unsigned int n)
{
const unsigned int mask =(CHAR_BIT * sizeof(x) 1); //例如31

assert((n <= mask)&&rotate by type width or more);
n& = mask; //使用NDEBUG避免undef行为。 0开销对于大多数类型/编译器
return(x << n)| (x>(( - n)& mask));
}

rotwidth_t rot_const(rotidth_t x)
{
return rotl(x,7);
}

这可以模板化x的类型,但它可能更真实使用,在函数名中有宽度(如 rotl32 )。通常当你旋转时,你知道你想要什么宽度,这比你目前存储的值的大小变量更重要。



还要确保只能在无符号类型中使用。带符号类型的右移会进行算术移位,以符号位移位。



Pabigot在我之前独立提出了相同的想法,并发布在gibhub 。他的版本有C ++的static_assert检查,使它的编译时错误使用旋转计数范围之外的类型。



测试了我的gcc.godbolt.org ,定义了NDEBUG,用于变量和编译时const循环计数:




  • gcc: gcc> = 4.9.0 的最优代码,非分支的负数+或更早。

    (编译时const count:gcc 4.4.7 is fine)

  • clang:包含 clang> = 3.5.0 的最佳代码,非分支neg +班次或早期。

    (编译期const循环计数:clang 3.0很好)

  • icc 13:最佳代码。

    (编译期const count与-march = native:生成较慢的 shld $ 7,%edi,%edi 。没有 -march = native



甚至更新的编译器版本可以处理来自wikipedia(包含在godbolt示例中)的常用代码,而不会生成分支或cmov。约翰Regehr的版本的优点是避免未定义的行为当旋转计数为0时。



有8和16位旋转的一些注意,但编译器似乎很好与32或64,当 n uint32_t 。请参阅 godbolt链接上的代码中的注释,了解我测试各种宽度的 uint * _t code>。希望这个成语将更好地被所有编译器识别为更多的类型宽度的组合在未来。有时候gcc会在旋转计数上无用地发出一个 AND insn,即使x86 ISA定义了与第一步完全相同的旋转insns。



optimal意味着效率如下:

 #gcc 4.9.2 rotl ,unsigned int):
movl%edi,%eax
movl%esi,%ecx
roll%cl,%eax
ret
#rot_const :
movl%edi,%eax
roll $ 7,%eax
ret


$ b b

当内联时,编译器应该能够首先将值设置在正确的寄存器中,从而只产生一个旋转。



旧编译器,当旋转计数是编译时常量时,仍然会得到理想的代码。 Godbolt让你测试用ARM作为目标,它也使用旋转。对于旧编译器的可变计数,你会得到一点代码膨胀,但没有分支或主要的性能问题,所以这个成语应该是安全使用一般。



BTW ,我修改约翰Regehr的原来使用CHAR_BIT * sizeof(x),gcc / clang / icc发出优化代码为 uint64_t 以及。但是,我注意到将 x 更改为 uint64_t ,而函数返回类型仍然是 uint32_t 使gcc编译它转换/或。因此,要小心地将结果转换为单独的序列点中的32bits,如果你想让64b的低32b旋转。即将结果分配给64位变量,然后强制转换/返回。 icc仍然生成一个旋转insn,但是gcc和clang不会,因为

  //生成慢代码:单独转换。 
uint32_t r =(uint32_t)((x (-n&(CHAR_BIT * sizeof(x)-1))));

如果任何人都可以用MSVC测试,那么知道发生了什么会很有用。


I'm having a heck of a time trying to come up with a constant time rotate that does not violate the C/C++ standards.

The problem is the edge/corner cases, where operations are called out in algorithms and those algorithms cannot be changed. For example, the following is from Crypto++ and executes the test harness under GCC ubsan (i.e., g++ fsanitize=undefined):

$ ./cryptest.exe v | grep runtime
misc.h:637:22: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
misc.h:643:22: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
misc.h:625:22: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
misc.h:637:22: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
misc.h:643:22: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
misc.h:637:22: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'

And the code at misc.h:637:

template <class T> inline T rotlMod(T x, unsigned int y)
{
    y %= sizeof(T)*8;
    return T((x<<y) | (x>>(sizeof(T)*8-y)));
}

Intel's ICC was particularly ruthless, and it removed the entire function call with out the y %= sizeof(T)*8. We fixed that a few years back, but left the other errata in-place due to lack of a constant time solution.

There's one pain point remaining. When y = 0, I get a condition where 32 - y = 32, and it sets up the undefined behavior. If I add a check for if(y == 0) ..., then the code fails to meet the constant time requirement.

I've looked at a number of other implementations, from the Linux kernel to other cryptographic libraries. They all contain the same undefined behavior, so it appears to be a dead end.

How can I perform the rotate in nearly constant time with a minimum number of instructions?

EDIT: by near constant time, I mean avoid the branch so the same instructions are always executed. I'm not worried about CPU microcode timings. While branch prediction may be great on x86/x64, it may not perform as well on other platforms, like embedded.


None of these tricks would be required if GCC or Clang provided an intrinsic to perform the rotate in near constant time. I'd even settle for "perform the rotate" since they don't even have that.

解决方案

I've linked to this answer for the full details from several other "rotate" questions, including this community wiki question, which should be kept up to date with best-practices.

I found a blog post about this issue, and it looks like it's finally a solved problem (with new-enough compiler versions).

John Regehr at the University of Utah recommends version "c" of his attempts at making a rotate function. I replaced his assert with a bitwise AND, and found that it still compiles to a single rotate insn.

typedef uint32_t rotwidth_t;  // parameterize for comparing compiler output with various sizes

rotwidth_t rotl (rotwidth_t x, unsigned int n)
{
  const unsigned int mask = (CHAR_BIT*sizeof(x)-1);  // e.g. 31

  assert ( (n<=mask)  &&"rotate by type width or more");
  n &= mask;  // avoid undef behaviour with NDEBUG.  0 overhead for most types / compilers
  return (x<<n) | (x>>( (-n)&mask ));
}

rotwidth_t rot_const(rotwidth_t x)
{
  return rotl(x, 7);
}

This could be templated on x's type, but it probably makes more sense for real use, to have the width in the function name (like rotl32). Usually when you're rotating, you know what width you want, and that matters more than what size variable you're currently storing the value in.

Also make sure to only use this with unsigned types. Right-shift of signed types does an arithmetic shift, shifting in sign-bits. (It's technically implementation-dependent behaviour, but everything uses 2's complement now.)

Pabigot independently came up with the same idea before I did, and posted it at gibhub. His version has C++ static_assert checking to make it a compile-time error to use a rotate count outside the range for the type.

I tested mine with gcc.godbolt.org, with NDEBUG defined, for variable and compile-time-const rotate counts:

  • gcc: optimal code with gcc >= 4.9.0, non-branching neg+shifts+or with earlier.
    (compile-time const count: gcc 4.4.7 is fine)
  • clang: optimal code with clang >= 3.5.0, non-branching neg+shifts+or with earlier.
    (compile-time const rotate count: clang 3.0 is fine)
  • icc 13: optimal code.
    (compile-time const count with -march=native: generates slower shld $7, %edi, %edi. Fine without -march=native)

Even newer compiler versions can handle the commonly-given code from wikipedia (included in the godbolt sample) without generating a branch or cmov. John Regehr's version has the advantage of avoiding undefined behaviour when the rotate count is 0.

There are some caveats with 8 and 16 bit rotates, but compilers seem fine with 32 or 64, when n is uint32_t. See the comments in the code on the godbolt link for some notes from my testing various widths of uint*_t. Hopefully this idiom will be better-recognized by all compilers for more combinations of type widths in the future. Sometimes gcc will uselessly emit an AND insn on the rotate count, even though the x86 ISA defines the rotate insns with that exact AND as the first step.

"optimal" means as efficient as:

# gcc 4.9.2 rotl(unsigned int, unsigned int):
    movl    %edi, %eax
    movl    %esi, %ecx
    roll    %cl, %eax
    ret
# rot_const(unsigned int):
    movl    %edi, %eax
    roll    $7, %eax
    ret

When inlined, the compiler should be able to arrange for values to be in the right registers in the first place, resulting in just a single rotate.

With older compilers, you'll still get ideal code when the rotate count is a compile-time constant. Godbolt lets you test with ARM as a target, and it used a rotate there, too. With variable counts on older compilers, you get a bit of code bloat, but no branches or major performance problems, so this idiom should be safe to use in general.

BTW, I modified John Regehr's original to use CHAR_BIT*sizeof(x), and gcc / clang / icc emit optimal code for uint64_t as well. However, I did notice that changing x to uint64_t while the function return type is still uint32_t makes gcc compile it to shifts/or. So be careful to cast the result to 32bits in a separate sequence point, if you want the low 32b of a 64b rotate. i.e. Assign the result to a 64bit variable, then cast/return it. icc still generates a rotate insn, but gcc and clang don't, for

// generates slow code: cast separately.
uint32_t r = (uint32_t)( (x<<n) | (x>>( -n&(CHAR_BIT*sizeof(x)-1) )) );

If anyone can test this with MSVC, it would be useful to know what happens there.

这篇关于近恒定时间旋转,不违反标准的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆