使用无符号而不是带符号的int更有可能导致错误吗?为什么? [英] Is using an unsigned rather than signed int more likely to cause bugs? Why?

查看:148
本文介绍了使用无符号而不是带符号的int更有可能导致错误吗?为什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Google C ++样式指南中,主题为未签名的整数" ,建议

由于历史原因,C ++标准还使用无符号整数来表示容器的大小-标准团体的许多成员认为这是一个错误,但实际上无法修复.无符号算术不能对简单整数的行为进行建模,而是由标准定义以对模块化算术进行建模(对上溢/下溢进行环绕),这一事实意味着编译器无法诊断出大量的错误.

模块化算法有什么问题?这不是unsigned int的预期行为吗?

该指南指的是哪种错误(重要的类)?错误泛滥?

不要仅仅使用无符号类型来断言变量是非负的.

我可以想到在无符号int上使用有符号int的一个原因是,如果它确实溢出(变为负数),则更易于检测.

解决方案

此处的某些答案提到了有符号和无符号值之间令人惊讶的提升规则,但这似乎更像是与 mixing 相关的问题有符号和无符号值,并不一定解释为什么在混合方案之外,以 signed 优先于 unsigned .

根据我的经验,除了混合比较和升级规则外,无符号值是吸引大错误的两个主要原因.

无符号值的不连续性为零,这是编程中最常见的值

无符号整数和有符号整数在其最小值和最大值处都有一个不连续性,它们在其中环绕(无符号)或导致未定义的行为(有符号).对于unsigned,这些点位于UINT_MAX.对于int,它们位于INT_MININT_MAX.在具有4字节int值的系统上,INT_MININT_MAX的典型值是-2^312^31-1,在这样的系统上,UINT_MAX通常是2^32-1.

unsigned不适用于int的主要缺陷是因为它的不连续性为零.零当然是程序中非常常见的值,例如1,2,3之类的小值.通常会在各种结构中添加和减去较小的值,尤其是1,并且如果从unsigned值中减去任何值并且恰好为零,那么您将得到一个巨大的正值和一个几乎确定的错误.

考虑者代码按索引遍历向量中的所有值,除了last 0.5 :

for (size_t i = 0; i < v.size() - 1; i++) { // do something }

这很好,直到有一天您传递空向量为止.您将得到v.size() - 1 == a giant number 1 而不是进行零次迭代,并且将进行40亿次迭代,并且几乎有一个缓冲区溢出漏洞.

您需要这样写:

for (size_t i = 0; i + 1 < v.size(); i++) { // do something }

因此可以将其固定"为在这种情况下,但只能仔细考虑size_t的无符号性质.有时,您无法应用上述修订,因为您没有要应用的固定偏移量,而是要应用一些可变的偏移量,该偏移量可以是正数或负数:您需要进行的比较取决于签名的性质-现在代码变得真的杂乱无章.

代码中有一个类似的问题,它试图迭代到零(包括零).像while (index-- > 0)这样的东西可以正常工作,但显然等效的while (--index >= 0)永远不会因无符号值而终止.当右侧 literal 为零时,编译器可能会警告您,但如果它是在运行时确定的值,则肯定不会警告您.

对等点

有人可能会认为带符号的值也有两个不连续之处,那么为什么选择不带符号的呢?不同之处在于两个不连续点都非常(最大)远离零.我确实认为这是一个单独的溢出"问题,有符号和无符号值都可能在非常大的值上溢出.在许多情况下,由于值的可能范围的限制,不可能发生溢出,并且实际上在物理上不可能发生许多64位值的溢出.即使可能,与零"故障相比,与溢出相关的故障的机会通常也很小.错误,并且未签名的值也会发生溢出.因此,无符号结合了两种情况中最糟糕的情况:潜在的溢出具有非常大的幅度值,并且不连续为零.只有前者签名.

许多人会争辩说你输了一点".与无符号.这通常是正确的-但并非总是如此(如果您需要表示无符号值之间的差异,则无论如何都会丢失该位:无论如何,如此多的32位内容仅限于2 GiB,或者您会在其中说一个奇怪的灰色区域一个文件可以是4 GiB,但不能在后2个GiB的一半使用某些API.

即使在未签名的情况下可以买到您的东西:它也买不了很多:如果您必须支持超过20亿个物",那么您可能很快就必须支持超过40亿./p>

逻辑上,无符号值是有符号值的子集

在数学上,无符号值(非负整数)是有符号整数(仅称为_integers)的子集. 2 .然而, signed 值自然会仅对 unsigned 值(例如减法)弹出操作.我们可以说未加符号的值在减法中不是 closed .签名值并非如此.

想要找到"delta"两个未签名的索引之间的文件?好吧,您最好按正确的顺序进行减法运算,否则您将得到错误的答案.当然,您通常需要运行时检查以确定正确的顺序!在将无符号值作为数字处理时,您经常会发现(逻辑上)带符号的值始终会出现,因此您最好从带符号开始.

对等点

如上面脚注(2)所述,C ++中的带符号值实际上不是大小相同的无符号值的子集,因此,无符号值可以表示与带符号值相同数量的结果.

是的,但是范围用处不大.考虑减法和0到2N范围内的无符号数,以及-N到N范围内的有符号数.在_两种情况下,任意减法都会导致-2N到2N范围内的结果,并且任何一种整数只能表示一半.事实证明,以-N到N的零为中心的区域通常比0到2N范围更有用(在现实世界代码中包含更多实际结果).考虑均匀分布以外的任何典型分布(对数,zipfian,正态分布等),并考虑从该分布中减去随机选择的值:以[-N,N]结尾的值多于[0,2N](实际上是结果分布)始终以零为中心.

使用位符号值作为数字的许多原因使64位关闭了大门

我认为上面的论点对于32位值已经很有说服力,但是对于32位值来说,发生溢出的情况在不同的阈值处都影响有符号和无符号,因此 do 会发生,因为"20亿"是一个可以被许多抽象和物理量(十亿美元,数十亿纳秒,包含数十亿个元素的数组)超过的数字.因此,如果有人对无符号值的正范围加倍有足够的信心,那么他们可以说溢出确实很重要,并且偏爱无符号.

在专用域之外,64位值在很大程度上消除了这种担忧.有符号的64位值的上限范围为9,223,372,036,854,775,807-超过9个五位数.那是很多纳秒(约292年的价值),而且很多钱.它也是一个更大的阵列,比任何计算机都可能在很长一段时间内在一致的地址空间中拥有RAM更大.那么,对于每个人来说(现在)9位数就足够了吗?

何时使用无符号值

请注意,样式指南并不禁止甚至不鼓励使用无符号数字.结论如下:

不要仅仅使用无符号类型来断言变量是非负的.

实际上,无符号变量有很好的用途:

  • 当您想将N位的数量不视为整数,而只是一个位包"时.例如,作为位掩码或位图,或N个布尔值或其他值.这种用法通常与固定宽度类型(如uint32_tuint64_t)并驾齐驱,因为您经常想知道变量的确切大小.提示某个特定变量值得此处理的提示是,您只能使用 bitwise 运算符对其进行操作,例如~|&^>>和而不是+-*/等算术运算.

    在这里,无符号是理想的,因为按位运算符的行为是定义明确和标准化的.带符号的值存在一些问题,例如移位时的不确定行为和不确定的表示以及不确定的表示形式.

  • 当您实际需要模块化算术时.有时您实际上需要2 ^ N模块化算术.在这些情况下,溢出"表示溢出".是一项功能,而不是错误.由于将无符号值定义为使用模块化算术,因此它们可为您提供所需的信息.有符号的值具有未指定的表示形式,并且未定义溢出,因此根本无法(轻松,有效地)使用有符号的值.


0.5 写完这篇文章后,我意识到这几乎与 Jarod的示例完全相同.我没见过-有充分的理由,这是一个很好的例子!

1 我们在这里谈论的是size_t,因此通常在32位系统上为2 ^ 32-1,在64位系统上为2 ^ 64-1.

2 在C ++中,情况并非如此,因为无符号值在上端包含的值比相应的有符号类型更多,但是存在一个基本问题,即操作无符号值可能会导致(在逻辑上)的带符号的值,但带符号的值没有相应的问题(因为带符号的值已经包含无符号的值).

In the Google C++ Style Guide, on the topic of "Unsigned Integers", it is suggested that

Because of historical accident, the C++ standard also uses unsigned integers to represent the size of containers - many members of the standards body believe this to be a mistake, but it is effectively impossible to fix at this point. The fact that unsigned arithmetic doesn't model the behavior of a simple integer, but is instead defined by the standard to model modular arithmetic (wrapping around on overflow/underflow), means that a significant class of bugs cannot be diagnosed by the compiler.

What is wrong with modular arithmetic? Isn't that the expected behaviour of an unsigned int?

What kind of bugs (a significant class) does the guide refer to? Overflowing bugs?

Do not use an unsigned type merely to assert that a variable is non-negative.

One reason that I can think of using signed int over unsigned int, is that if it does overflow (to negative), it is easier to detect.

解决方案

Some of the answers here mention the surprising promotion rules between signed and unsigned values, but this seems more like a problem relating to mixing signed and unsigned values, and doesn't necessarily explain why signed is preferred over unsigned, outside of mixing scenarios.

In my experience, outside of mixed comparisons and promotion rules, there are two primary reasons why unsigned values are big bug magnets.

Unsigned values have a discontinuity at zero, the most common value in programming

Both unsigned and signed integers have a discontinuities at their minimum and maximum values, where they wrap around (unsigned) or cause undefined behavior (signed). For unsigned these points are at zero and UINT_MAX. For int they are at INT_MIN and INT_MAX. Typical values of INT_MIN and INT_MAX on system with 4-byte int values are -2^31 and 2^31-1, and on such a system UINT_MAX is typically 2^32-1.

The primary bug-inducing problem with unsigned that doesn't apply to int is that it has a discontinuity at zero. Zero, of course, is a very common value in programs, along with other small values like 1,2,3. It is common to add and subtract small values, especially 1, in various constructs, and if you subtract anything from an unsigned value and it happens to be zero, you just got a massive positive value and an almost certain bug.

Consider code iterates over all values in a vector by index except the last0.5:

for (size_t i = 0; i < v.size() - 1; i++) { // do something }

This works fine until one day you pass in an empty vector. Instead of doing zero iterations, you get v.size() - 1 == a giant number1 and you'll do 4 billion iterations and almost have a buffer overflow vulnerability.

You need to write it like this:

for (size_t i = 0; i + 1 < v.size(); i++) { // do something }

So it can be "fixed" in this case, but only by carefully thinking about the unsigned nature of size_t. Sometimes you can't apply the fix above because instead of a constant one you have some variable offset you want to apply, which may be positive or negative: so which "side" of the comparison you need to put it on depends on the signedness - now the code gets really messy.

There is a similar issue with code that tries to iterate down to and including zero. Something like while (index-- > 0) works fine, but the apparently equivalent while (--index >= 0) will never terminate for an unsigned value. Your compiler might warn you when the right hand side is literal zero, but certainly not if it is a value determined at runtime.

Counterpoint

Some might argue that signed values also have two discontinuities, so why pick on unsigned? The difference is that both discontinuities are very (maximally) far away from zero. I really consider this a separate problem of "overflow", both signed and unsigned values may overflow at very large values. In many cases overflow is impossible due to constraints on the possible range of the values, and overflow of many 64-bit values may be physically impossible). Even if possible, the chance of an overflow related bug is often minuscule compared to an "at zero" bug, and overflow occurs for unsigned values too. So unsigned combines the worst of both worlds: potentially overflow with very large magnitude values, and a discontinuity at zero. Signed only has the former.

Many will argue "you lose a bit" with unsigned. This is often true - but not always (if you need to represent differences between unsigned values you'll lose that bit anyways: so many 32-bit things are limited to 2 GiB anyways, or you'll have a weird grey area where say a file can be 4 GiB, but you can't use certain APIs on the second 2 GiB half).

Even in the cases where unsigned buys you a bit: it doesn't buy you much: if you had to support more than 2 billion "things", you'll probably soon have to support more than 4 billion.

Logically, unsigned values are a subset of signed values

Mathematically, unsigned values (non-negative integers) are a subset of signed integers (just called _integers).2. Yet signed values naturally pop out of operations solely on unsigned values, such as subtraction. We might say that unsigned values aren't closed under subtraction. The same isn't true of signed values.

Want to find the "delta" between two unsigned indexes into a file? Well you better do the subtraction in the right order, or else you'll get the wrong answer. Of course, you often need a runtime check to determine the right order! When dealing with unsigned values as numbers, you'll often find that (logically) signed values keep appearing anyways, so you might as well start of with signed.

Counterpoint

As mentioned in footnote (2) above, signed values in C++ aren't actually a subset of unsigned values of the same size, so unsigned values can represent the same number of results that signed values can.

True, but the range is less useful. Consider subtraction, and unsigned numbers with a range of 0 to 2N, and signed numbers with a range of -N to N. Arbitrary subtractions result in results in the range -2N to 2N in _both cases, and either type of integer can only represent half of it. Well it turns out that the region centered around zero of -N to N is usually way more useful (contains more actual results in real world code) than the range 0 to 2N. Consider any of typical distribution other than uniform (log, zipfian, normal, whatever) and consider subtracting randomly selected values from that distribution: way more values end up in [-N, N] than [0, 2N] (indeed, resulting distribution is always centered at zero).

64-bit closes the door on many of the reasons to use signed values as numbers

I think the arguments above were already compelling for 32-bit values, but the overflow cases, which affect both signed and unsigned at different thresholds, do occur for 32-bit values, since "2 billion" is a number that can exceeded by many abstract and physical quantities (billions of dollars, billions of nanoseconds, arrays with billions of elements). So if someone is convinced enough by the doubling of the positive range for unsigned values, they can make the case that overflow does matter and it slightly favors unsigned.

Outside of specialized domains 64-bit values largely remove this concern. Signed 64-bit values have an upper range of 9,223,372,036,854,775,807 - more than nine quintillion. That's a lot of nanoseconds (about 292 years worth), and a lot of money. It's also a larger array than any computer is likely to have RAM in a coherent address space for a long time. So maybe 9 quintillion is enough for everybody (for now)?

When to use unsigned values

Note that the style guide doesn't forbid or even necessarily discourage use of unsigned numbers. It concludes with:

Do not use an unsigned type merely to assert that a variable is non-negative.

Indeed, there are good uses for unsigned variables:

  • When you want to treat an N-bit quantity not as an integer, but simply a "bag of bits". For example, as a bitmask or bitmap, or N boolean values or whatever. This use often goes hand-in-hand with the fixed width types like uint32_t and uint64_t since you often want to know the exact size of the variable. A hint that a particular variable deserves this treatment is that you only operate on it with with the bitwise operators such as ~, |, &, ^, >> and so on, and not with the arithmetic operations such as +, -, *, / etc.

    Unsigned is ideal here because the behavior of the bitwise operators is well-defined and standardized. Signed values have several problems, such as undefined and unspecified behavior when shifting, and an unspecified representation.

  • When you actually want modular arithmetic. Sometimes you actually want 2^N modular arithmetic. In these cases "overflow" is a feature, not a bug. Unsigned values give you what you want here since they are defined to use modular arithmetic. Signed values cannot be (easily, efficiently) used at all since they have an unspecified representation and overflow is undefined.


0.5 After I wrote this I realized this is nearly identical to Jarod's example, which I hadn't seen - and for good reason, it's a good example!

1 We're talking about size_t here so usually 2^32-1 on a 32-bit system or 2^64-1 on a 64-bit one.

2 In C++ this isn't exactly the case because unsigned values contain more values at the upper end than the corresponding signed type, but the basic problem exists that manipulating unsigned values can result in (logically) signed values, but there is no corresponding issue with signed values (since signed values already include unsigned values).

这篇关于使用无符号而不是带符号的int更有可能导致错误吗?为什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆