高效的无符号到签名转换避免实现定义的行为 [英] Efficient unsigned-to-signed cast avoiding implementation-defined behavior

查看:107
本文介绍了高效的无符号到签名转换避免实现定义的行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想定义一个以 unsigned int 为参数并返回一个 int 同余模UINT_MAX +

 <$ c $ 



c> int unsigned_to_signed(unsigned n)
{
return static_cast< int>(n);
}

但是正如任何语言律师所知,从无符号转换为有符号大于INT_MAX是实现定义的。



我想实现这样,(a)它只依赖于spec规定的行为;



对于奇怪的机器...如果没有signed int congruent modulo UINT_MAX,那么它将编译成任何现代机器上的无操作和优化编译器。 +1到unsigned int,让我们说我想抛出一个异常。如果有多个(我不知道这是可能的),让我们说想要最大的一个。



OK,第二次尝试:

  int unsigned_to_signed(unsigned n)
{
int int_n = static_cast< int>(n)

if(n == static_cast< unsigned>(int_n))
return int_n;

// else do something long and complicated
}

我不太关心效率,当我不是一个典型的二补充系统,因为在我谦卑的意见是不太可能。如果我的代码成为2050年无处不在的符号幅度系统的瓶颈,那么我敢打赌有人可以算出并优化它。



现在,尝试是非常接近我想要的。虽然对 int 的转换是对某些输入定义的实现,但转换回 unsigned 保留值UINT_MAX + 1的值。所以条件确实检查我想要的,它将编译为任何系统我可能遇到的没有。



但是,我仍然投射到 int ,而不首先检查是否将调用实现定义的行为。在一些假设的系统在2050年它可以做谁知道什么。



问题:我的第三次尝试应该是什么样子?



>总结一下,我想:




  • 从unsigned int转换为signed int

  • 保留值mod UINT_MAX + 1

  • 仅调用标准强制行为

  • 在具有优化编译器的典型二进制补码机上编译为无操作



[更新]



让我举个例子,不是一个小问题。



考虑一个假设的C ++实现,具有以下属性:




  • sizeof(int)等于4

  • 4

  • INT_MAX 等于32767

  • INT_MIN 等于-2 32 + 32768

  • UINT_MAX 等于2 32 - 1

  • int 上的算术模2 32 > INT_MIN INT_MAX

  • std :: numeric_limits< int> ;: :is_modulo 为true

  • 将无符号 n 转换为int将保留0的值< ; = 32767,并产生否则



在这个假设的实现中,每个无符号值的值c#int value congruent(mod UINT_MAX + 1)。



我声称这个假设的C ++实现完全符合C ++ 98,C ++ 03和C ++ 11规格。我承认我没有记住所有的每一个字...但我相信我已仔细阅读相关部分。所以,如果你想让我接受你的答案,你必须(a)引用一个规则排除这个假设的实现或(b)正确处理。



正确的答案必须处理该标准允许的每个假设实现。



顺便提一下,注意 std :: numeric_limits< int> :: is_modulo 在这里完全没有用,原因有多种。一方面,即使对于大的无符号值,无符号到有符号的转型不起作用,它也可以是 true 。对于另一个,如果算术只是对整个整数范围进行取模,它可以是 true ,即使是一个补码或符号幅度系统。等等。如果您的回答取决于 is_modulo ,则会出错。



[更新2]



hvd的回答教给我一些东西:我假设的整数C ++实现是不是 C99和C11标准对有符号整数的表示非常具体;实际上,它们只允许二元补码,一元补码和符号量(第6.2.6.2节(2)节)。



但是C ++不是C.事实证明,这个事实是我的问题的核心。



原来的C ++ 98标准是基于更古老的C89,它说对于每个有符号的整数类型,有一个相应的(但是
不同的)无符号整数类型(无符号整数类型):


< (使用关键字
指定无符号),它使用相同的存储量(包括符号
信息),并具有相同的对齐要求。有符号整数类型的
非负值的范围是
对应的无符号整数类型的子范围,并且每个类型中
相同值的表示是相同的。


C89没有说明只有一个符号位或只允许二进制补码/一补码/符号量。



C ++ 98标准几乎逐字地采用这种语言(第3.9.1节(3)):


对于每个有符号整数类型,存在相应的
(但不同的)无符号整数类型 unsigned char unsigned
short int
unsigned int unsigned long int
中的每一个占用相同的存储量,并且具有与对应的有符号整数类型相同的对齐
要求(3.9)
是,每个有符号整数类型具有与
相同的对象表示,其对应的无符号整数类型。有符号整数类型的非负值
值的范围是相应的
无符号整数类型的子范围,并且每个
对应有符号/无符号类型的值表示应该是相同的。 p>

C ++ 03标准使用与C ++ 11基本相同的语言。



没有标准的C ++规范将其有符号整数表示限制到任何C规范,就我所知。没有什么要求一个单一的符号位或任何东西。它说的是非负有符号整数必须是相应的unsigned的子范围。



因此,我声称INT_MAX = 32767允许INT_MIN = -2 32 +32768。如果您的回答假设不是这样,除非您引用 C ++ 标准证明我错了,否则这是错误的。

解决方案

扩展user71404的答案:

  int f(unsigned x)
{
if(x& = INT_MAX)
return static_cast< int>(x);

if(x> = INT_MIN)
return static_cast< int>(x-INT_MIN)+ INT_MIN;

throw x; //或任何其他你喜欢的
}

如果 x> ; = INT_MIN (请记住促销规则, INT_MIN 获得转换为 unsigned ) , x - INT_MIN <= INT_MAX ,因此不会有任何溢出。



不明显,看看索赔如果 x> = -4u ,则 x + 4 <= 3 。,并且记住 INT_MAX 将至少等于-INT_MIN - 1的数学值。



在最常见的系统上,!(x <= INT_MAX)意味着 x> = INT_MIN ,优化器应该能够(并且在我的系统上)能够删除第二个检查,确定两个返回语句可以编译为相同的代码,并且删除第一个检查。生成的汇编列表:

  __ Z1fj:
LFB6:
.cfi_startproc
movl 4 esp),%eax
ret
.cfi_endproc




  • INT_MAX等于32767

  • INT_MIN等于-2 32 + 32768



是不可能的,因此不需要特别注意。 INT_MIN 将等于 -INT_MAX -INT_MAX - 1 。这来自C的整数类型表示(6.2.6.2),它需要 n 位为值位,一个位为符号位,并且只允许一个陷阱表示(不包括由于填充位而无效的表示),即否则将表示负零/ -INT_MAX-1 的表示。 C ++不允许任何超出C允许的整数表示。



更新 :Microsoft的编译器显然没有注意到 x> 10 x> = 11 测试同样的事情。如果 x> = INT_MIN 被替换为 x>,则它只生成所需的代码。 INT_MIN_1u ,它可以检测为 x <= INT_MAX (在此平台上)的否定。



[更新提问者(Nemo),详细阐述了我们下面的讨论]



我现在相信这个答案适用于所有情况,原因。我可能会给这个解决方案的赏金,但我想捕获所有的血腥细节,以防任何人关心。



让我们从C ++ 11,第18.3节开始.3:


表31描述了< climits> p>

...



内容与标准C库标题相同 limits.h>




此处,标准C表示C99,其规格严格约束表示的有符号整数。它们就像无符号整数,但是一个位专用于符号,零个或多个位专用于填充。填充位对整数的值没有贡献,符号位仅作为二进制补码,一补码或符号幅度。



由于C ++ 11继承C99的< climits> 宏,INT_MIN为-INT_MAX或-INT_MAX-1,并且hvd的代码保证工作。 (注意,由于填充,INT_MAX可能比UINT_MAX / 2小得多...但是由于signed-> unsigned casts的工作方式,这个回答处理的很好。)



C ++ 03 / C ++ 98是棘手的。它使用相同的措辞从标准C继承< climits> ,但现在标准C意味着C89 / C90。



所有这些 - C ++ 98,C ++ 03,C89 / C90 - 有我在我的问题中给出的措辞,但也包括这个(C ++ 03第3.9.1节第7 ):


整数类型的表示法应该使用
纯二进制数字系统来定义值。 em>
:this International
标准允许对整数类型使用2的补码,1的补码和带符号的
表示。]


Footnote(44)定义了纯二进制数字系统:


数字0
和1,其中由连续位表示的值是
additive,从1开始,并乘以2的连续积分
次幂,除了可能对于具有最高位置。


这个词语的有趣之处在于它与自己矛盾,因为纯二进制数字系统的定义不允许符号/幅度表示!它允许高位具有例如值-2 n-1 (二进制补码)或 - (2 n-1 -1) 。但是,对于导致符号/幅度的高位没有价值。



无论如何,我的假设实现不符合这个定义下的纯二进制所以它被排除。



但是,高位是特殊的事实意味着我们可以想象它贡献任何值:一个小的正值,巨大的正值,小负值或巨大负值。 (如果符号位可以贡献 - (2 n-1 -1),为什么不能 - (2 n-1 -2)

所以,让我们想象一个有符号的整数表示,为符号位分配一个古怪的值。



符号位的小正值将导致 int 的正范围(可能大到 unsigned ),并且hvd的代码处理得很好。



符号位的一个巨大的正值将导致 int 的最大值大于 unsigned ,这是禁止的。



符号位的一个巨大的负值会导致 int 表示不连续的值范围,以及spec规则中的其他字。



最后,如果一个符号位贡献了一个小的负数量?我们可以有一个在符号位贡献,说,-37的值的int?那么INT_MAX将是(比方说)2 31 -1和INT_MIN将是-37?



这将导致一些数字有两个表示...但是,单元补码给出两个表示为零,这是根据示例允许的。规格中没有说明零是可能有两个表示的整数。所以我认为这个新的假设是允许的规范。



确实,任何负值从-1下降到 -INT_MAX-1 看起来可以作为符号位的值,但不能小于(不小于该范围是不连续的)。换句话说, INT_MIN 可能是从 -INT_MAX-1 到-1的任何东西。



现在,猜什么?对于hvd代码中的第二个转换,避免实现定义的行为,我们只需要 x - (unsigned)INT_MIN 小于或等于 INT_MAX 。我们刚刚显示 INT_MIN 至少是 -INT_MAX-1 。显然, x 最多为 UINT_MAX 。将负数转换为unsigned与添加 UINT_MAX + 1 相同。

  x  - (unsigned)INT_MIN< = INT_MAX 

当且仅当

  UINT_MAX  - + UINT_MAX + 1)<= INT_MAX 
-INT_MIN-1 <= INT_MAX
-INT_MIN <= INT_MAX + 1
INT_MIN> = -INT_MAX-1

最后是我们刚刚展示的,所以即使在这种不正常的情况下,代码实际上也能工作。



这耗尽所有的可能性,从而结束了这种极端的学术练习。



底线:有一些严重低估的行为由C ++ 98 / C ++ 03继承的C89 / C90中的有符号整数。它在C99中是固定的,C ++ 11通过引入C99中的< limits.h> 间接继承该修复。但是即使C ++ 11保留了自相矛盾的纯二进制表示的措辞...


I want to define a function that takes an unsigned int as argument and returns an int congruent modulo UINT_MAX+1 to the argument.

A first attempt might look like this:

int unsigned_to_signed(unsigned n)
{
    return static_cast<int>(n);
}

But as any language lawyer knows, casting from unsigned to signed for values larger than INT_MAX is implementation-defined.

I want to implement this such that (a) it only relies on behavior mandated by the spec; and (b) it compiles into a no-op on any modern machine and optimizing compiler.

As for bizarre machines... If there is no signed int congruent modulo UINT_MAX+1 to the unsigned int, let's say I want to throw an exception. If there is more than one (I am not sure this is possible), let's say I want the largest one.

OK, second attempt:

int unsigned_to_signed(unsigned n)
{
    int int_n = static_cast<int>(n);

    if (n == static_cast<unsigned>(int_n))
        return int_n;

    // else do something long and complicated
}

I do not much care about the efficiency when I am not on a typical twos-complement system, since in my humble opinion that is unlikely. And if my code becomes a bottleneck on the omnipresent sign-magnitude systems of 2050, well, I bet someone can figure that out and optimize it then.

Now, this second attempt is pretty close to what I want. Although the cast to int is implementation-defined for some inputs, the cast back to unsigned is guaranteed by the standard to preserve the value modulo UINT_MAX+1. So the conditional does check exactly what I want, and it will compile into nothing on any system I am likely to encounter.

However... I am still casting to int without first checking whether it will invoke implementation-defined behavior. On some hypothetical system in 2050 it could do who-knows-what. So let's say I want to avoid that.

Question: What should my "third attempt" look like?

To recap, I want to:

  • Cast from unsigned int to signed int
  • Preserve the value mod UINT_MAX+1
  • Invoke only standard-mandated behavior
  • Compile into a no-op on a typical twos-complement machine with optimizing compiler

[Update]

Let me give an example to show why this is not a trivial question.

Consider a hypothetical C++ implementation with the following properties:

  • sizeof(int) equals 4
  • sizeof(unsigned) equals 4
  • INT_MAX equals 32767
  • INT_MIN equals -232 + 32768
  • UINT_MAX equals 232 - 1
  • Arithmetic on int is modulo 232 (into the range INT_MIN through INT_MAX)
  • std::numeric_limits<int>::is_modulo is true
  • Casting unsigned n to int preserves the value for 0 <= n <= 32767 and yields zero otherwise

On this hypothetical implementation, there is exactly one int value congruent (mod UINT_MAX+1) to each unsigned value. So my question would be well-defined.

I claim that this hypothetical C++ implementation fully conforms to the C++98, C++03, and C++11 specifications. I admit I have not memorized every word of all of them... But I believe I have read the relevant sections carefully. So if you want me to accept your answer, you either must (a) cite a spec that rules out this hypothetical implementation or (b) handle it correctly.

Indeed, a correct answer must handle every hypothetical implementation permitted by the standard. That is what "invoke only standard-mandated behavior" means, by definition.

Incidentally, note that std::numeric_limits<int>::is_modulo is utterly useless here for multiple reasons. For one thing, it can be true even if unsigned-to-signed casts do not work for large unsigned values. For another, it can be true even on one's-complement or sign-magnitude systems, if arithmetic is simply modulo the entire integer range. And so on. If your answer depends on is_modulo, it's wrong.

[Update 2]

hvd's answer taught me something: My hypothetical C++ implementation for integers is not permitted by modern C. The C99 and C11 standards are very specific about the representation of signed integers; indeed, they only permit twos-complement, ones-complement, and sign-magnitude (section 6.2.6.2 paragraph (2); ).

But C++ is not C. As it turns out, this fact lies at the very heart of my question.

The original C++98 standard was based on the much older C89, which says (section 3.1.2.5):

For each of the signed integer types, there is a corresponding (but different) unsigned integer type (designated with the keyword unsigned) that uses the same amount of storage (including sign information) and has the same alignment requirements. The range of nonnegative values of a signed integer type is a subrange of the corresponding unsigned integer type, and the representation of the same value in each type is the same.

C89 says nothing about only having one sign bit or only allowing twos-complement/ones-complement/sign-magnitude.

The C++98 standard adopted this language nearly verbatim (section 3.9.1 paragraph (3)):

For each of the signed integer types, there exists a corresponding (but different) unsigned integer type: "unsigned char", "unsigned short int", "unsigned int", and "unsigned long int", each of which occupies the same amount of storage and has the same alignment requirements (3.9) as the corresponding signed integer type ; that is, each signed integer type has the same object representation as its corresponding unsigned integer type. The range of nonnegative values of a signed integer type is a subrange of the corresponding unsigned integer type, and the value representation of each corresponding signed/unsigned type shall be the same.

The C++03 standard uses essentially identical language, as does C++11.

No standard C++ spec constrains its signed integer representations to any C spec, as far as I can tell. And there is nothing mandating a single sign bit or anything of the kind. All it says is that non-negative signed integers must be a subrange of the corresponding unsigned.

So, again I claim that INT_MAX=32767 with INT_MIN=-232+32768 is permitted. If your answer assumes otherwise, it is incorrect unless you cite a C++ standard proving me wrong.

解决方案

Expanding on user71404's answer:

int f(unsigned x)
{
    if (x <= INT_MAX)
        return static_cast<int>(x);

    if (x >= INT_MIN)
        return static_cast<int>(x - INT_MIN) + INT_MIN;

    throw x; // Or whatever else you like
}

If x >= INT_MIN (keep the promotion rules in mind, INT_MIN gets converted to unsigned), then x - INT_MIN <= INT_MAX, so this won't have any overflow.

If that is not obvious, take a look at the claim "If x >= -4u, then x + 4 <= 3.", and keep in mind that INT_MAX will be equal to at least the mathematical value of -INT_MIN - 1.

On the most common systems, where !(x <= INT_MAX) implies x >= INT_MIN, the optimizer should be able (and on my system, is able) to remove the second check, determine that the two return statements can be compiled to the same code, and remove the first check too. Generated assembly listing:

__Z1fj:
LFB6:
    .cfi_startproc
    movl    4(%esp), %eax
    ret
    .cfi_endproc

The hypothetical implementation in your question:

  • INT_MAX equals 32767
  • INT_MIN equals -232 + 32768

is not possible, so does not need special consideration. INT_MIN will be equal to either -INT_MAX, or to -INT_MAX - 1. This follows from C's representation of integer types (6.2.6.2), which requires n bits to be value bits, one bit to be a sign bit, and only allows one single trap representation (not including representations that are invalid because of padding bits), namely the one that would otherwise represent negative zero / -INT_MAX - 1. C++ doesn't allow any integer representations beyond what C allows.

Update: Microsoft's compiler apparently does not notice that x > 10 and x >= 11 test the same thing. It only generates the desired code if x >= INT_MIN is replaced with x > INT_MIN - 1u, which it can detect as the negation of x <= INT_MAX (on this platform).

[Update from questioner (Nemo), elaborating on our discussion below]

I now believe this answer works in all cases, but for complicated reasons. I am likely to award the bounty to this solution, but I want to capture all the gory details in case anybody cares.

Let's start with C++11, section 18.3.3:

Table 31 describes the header <climits>.

...

The contents are the same as the Standard C library header <limits.h>.

Here, "Standard C" means C99, whose specification severely constrains the representation of signed integers. They are just like unsigned integers, but with one bit dedicated to "sign" and zero or more bits dedicated to "padding". The padding bits do not contribute to the value of the integer, and the sign bit contributes only as twos-complement, ones-complement, or sign-magnitude.

Since C++11 inherits the <climits> macros from C99, INT_MIN is either -INT_MAX or -INT_MAX-1, and hvd's code is guaranteed to work. (Note that, due to the padding, INT_MAX could be much less than UINT_MAX/2... But thanks to the way signed->unsigned casts work, this answer handles that fine.)

C++03/C++98 is trickier. It uses the same wording to inherit <climits> from "Standard C", but now "Standard C" means C89/C90.

All of these -- C++98, C++03, C89/C90 -- have the wording I give in my question, but also include this (C++03 section 3.9.1 paragraph 7):

The representations of integral types shall define values by use of a pure binary numeration system.(44) [Example: this International Standard permits 2’s complement, 1’s complement and signed magnitude representations for integral types.]

Footnote (44) defines "pure binary numeration system":

A positional representation for integers that uses the binary digits 0 and 1, in which the values represented by successive bits are additive, begin with 1, and are multiplied by successive integral power of 2, except perhaps for the bit with the highest position.

What is interesting about this wording is that it contradicts itself, because the definition of "pure binary numeration system" does not permit a sign/magnitude representation! It does allow the high bit to have, say, the value -2n-1 (twos complement) or -(2n-1-1) (ones complement). But there is no value for the high bit that results in sign/magnitude.

Anyway, my "hypothetical implementation" does not qualify as "pure binary" under this definition, so it is ruled out.

However, the fact that the high bit is special means we can imagine it contributing any value at all: A small positive value, huge positive value, small negative value, or huge negative value. (If the sign bit can contribute -(2n-1-1), why not -(2n-1-2)? etc.)

So, let's imagine a signed integer representation that assigns a wacky value to the "sign" bit.

A small positive value for the sign bit would result in a positive range for int (possibly as large as unsigned), and hvd's code handles that just fine.

A huge positive value for the sign bit would result in int having a maximum larger than unsigned, which is is forbidden.

A huge negative value for the sign bit would result in int representing a non-contiguous range of values, and other wording in the spec rules that out.

Finally, how about a sign bit that contributes a small negative quantity? Could we have a 1 in the "sign bit" contribute, say, -37 to the value of the int? So then INT_MAX would be (say) 231-1 and INT_MIN would be -37?

This would result in some numbers having two representations... But ones-complement gives two representations to zero, and that is allowed according to the "Example". Nowhere does the spec say that zero is the only integer that might have two representations. So I think this new hypothetical is allowed by the spec.

Indeed, any negative value from -1 down to -INT_MAX-1 appears to be permissible as a value for the "sign bit", but nothing smaller (lest the range be non-contiguous). In other words, INT_MIN might be anything from -INT_MAX-1 to -1.

Now, guess what? For the second cast in hvd's code to avoid implementation-defined behavior, we just need x - (unsigned)INT_MIN less than or equal to INT_MAX. We just showed INT_MIN is at least -INT_MAX-1. Obviously, x is at most UINT_MAX. Casting a negative number to unsigned is the same as adding UINT_MAX+1. Put it all together:

x - (unsigned)INT_MIN <= INT_MAX

if and only if

UINT_MAX - (INT_MIN + UINT_MAX + 1) <= INT_MAX
-INT_MIN-1 <= INT_MAX
-INT_MIN <= INT_MAX+1
INT_MIN >= -INT_MAX-1

That last is what we just showed, so even in this perverse case, the code actually works.

That exhausts all of the possibilities, thus ending this extremely academic exercise.

Bottom line: There is some seriously under-specified behavior for signed integers in C89/C90 that got inherited by C++98/C++03. It is fixed in C99, and C++11 indirectly inherits the fix by incorporating <limits.h> from C99. But even C++11 retains the self-contradictory "pure binary representation" wording...

这篇关于高效的无符号到签名转换避免实现定义的行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆