为什么类型总是不管大小而定大小? [英] Why are types always a certain size no matter its value?

查看:42
本文介绍了为什么类型总是不管大小而定大小?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

类型的实际大小之间的实现可能有所不同,但是在大多数情况下,像unsigned int和float这样的类型始终为4个字节.但是,为什么一个类型无论其值如何总是占据确定数量的内存?例如,如果我创建了以下值为255的整数

Implementations might differ between the actual sizes of types, but on most, types like unsigned int and float are always 4 bytes. But why does a type always occupy a certain amount of memory no matter its value? For example, if I created the following integer with the value of 255

int myInt = 255;

然后 myInt 在我的编译器中将占用4个字节.但是,实际值 255 只能用1个字节表示,那么为什么 myInt 不仅会占用1个字节的内存?或更普遍的询问方式:当表示该值所需的空间可能小于该大小时,为什么一个类型仅与该大小相关联?

Then myInt would occupy 4 bytes with my compiler. However, the actual value, 255 can be represented with only 1 byte, so why would myInt not just occupy 1 byte of memory? Or the more generalized way of asking: Why does a type have only one size associated with it when the space required to represent the value might be smaller than that size?

推荐答案

编译器应该为某些机器生成汇编器(最终是机器代码),并且通常C ++会同情该机器.

The compiler is supposed to produce assembler (and ultimately machine code) for some machine, and generally C++ tries to be sympathetic to that machine.

同情底层机器意味着:使编写C ++代码变得容易,它将有效地映射到机器可以快速执行的操作上.因此,我们希望提供对硬件平台上快速且自然"的数据类型和操作的访问.

Being sympathetic to the underlying machine means roughly: making it easy to write C++ code which will map efficiently onto the operations the machine can execute quickly. So, we want to provide access to the data types and operations that are fast and "natural" on our hardware platform.

具体考虑特定的计算机体系结构.让我们以当前的Intel x86家族为例.

Concretely, consider a specific machine architecture. Let's take the current Intel x86 family.

《英特尔®64和IA-32架构软件开发人员手册》第一卷(

The Intel® 64 and IA-32 Architectures Software Developer’s Manual vol 1 (link), section 3.4.1 says:

32位通用寄存器EAX,EBX,ECX,EDX,提供ESI,EDI,EBP和ESP用于保存以下项目:

The 32-bit general-purpose registers EAX, EBX, ECX, EDX, ESI, EDI, EBP, and ESP are provided for holding the following items:

•用于逻辑和算术运算的操作数

• Operands for logical and arithmetic operations

•用于计算地址的操作数

• Operands for address calculations

•内存指针

因此,我们希望编译器在编译简单的C ++整数算法时使用这些EAX,EBX等寄存器.这意味着当我声明一个 int 时,它应该与这些寄存器兼容,以便我可以有效地使用它们.

So, we want the compiler to use these EAX, EBX etc. registers when it compiles simple C++ integer arithmetic. This means that when I declare an int, it should be something compatible with these registers, so that I can use them efficiently.

寄存器的大小始终相同(此处为32位),因此我的 int 变量也将始终为32位.我将使用相同的布局(小端),这样就不必每次将变量值加载到寄存器中或将寄存器存储回变量中时都进行转换.

The registers are always the same size (here, 32 bits), so my int variables will always be 32 bits as well. I'll use the same layout (little-endian) so that I don't have to do a conversion every time I load a variable value into a register, or store a register back into a variable.

使用 godbolt ,我们可以确切地看到编译器对一些琐碎的代码做了什么:

Using godbolt we can see exactly what the compiler does for some trivial code:

int square(int num) {
    return num * num;
}

(为简单起见,使用GCC 8.1和 -fomit-frame-pointer -O3 )编译为:

compiles (with GCC 8.1 and -fomit-frame-pointer -O3 for simplicity) to:

square(int):
  imul edi, edi
  mov eax, edi
  ret

这意味着:

  1. int num 参数已在寄存器EDI中传递,这意味着它恰好是Intel期望的本机寄存器的大小和布局.该功能无需转换任何内容
  2. 乘法是一条指令( imul ),速度非常快
  3. 返回结果只是将其复制到另一个寄存器中(调用者希望将结果放入EAX中)
  1. the int num parameter was passed in register EDI, meaning it's exactly the size and layout Intel expect for a native register. The function doesn't have to convert anything
  2. the multiplication is a single instruction (imul), which is very fast
  3. returning the result is simply a matter of copying it to another register (the caller expects the result to be put in EAX)


我们可以添加相关的比较,以使用非本地版面制作显示差异.最简单的情况是将值存储为非本地宽度.


we can add a relevant comparison to show the difference using a non-native layout makes. The simplest case is storing values in something other than native width.

再次使用 godbolt ,我们可以比较简单的本机乘法

Using godbolt again, we can compare a simple native multiplication

unsigned mult (unsigned x, unsigned y)
{
    return x*y;
}

mult(unsigned int, unsigned int):
  mov eax, edi
  imul eax, esi
  ret

使用等效代码表示非标准宽度

with the equivalent code for a non-standard width

struct pair {
    unsigned x : 31;
    unsigned y : 31;
};

unsigned mult (pair p)
{
    return p.x*p.y;
}

mult(pair):
  mov eax, edi
  shr rdi, 32
  and eax, 2147483647
  and edi, 2147483647
  imul eax, edi
  ret

所有其他指令都与将输入格式(两个31位无符号整数)转换成处理器可以本地处理的格式有关.如果我们想将结果存储回31位值,那么将有另一条或两条指令来执行此操作.

All the extra instructions are concerned with converting the input format (two 31-bit unsigned integers) into the format the processor can handle natively. If we wanted to store the result back into a 31-bit value, there would be another one or two instructions to do this.

这种额外的复杂性意味着只有在节省空间非常重要的情况下,您才需要为此而烦恼.在这种情况下,与使用本机 unsigned uint32_t 类型相比,我们只节省了两位,因为它们会生成简单得多的代码.

This extra complexity means you'd only bother with this when the space saving is very important. In this case we're only saving two bits compared to using the native unsigned or uint32_t type, which would have generated much simpler code.

上面的示例仍然是固定宽度值,而不是可变宽度,但是宽度(和对齐方式)不再与本机寄存器匹配.

The example above is still fixed-width values rather than variable-width, but the width (and alignment) no longer match the native registers.

x86平台具有多种本机大小,除了主要的32位外,还包括8位和16位(为简化起见,我对64位模式进行了修饰,并提供了许多其他内容以简化操作).

The x86 platform has several native sizes, including 8-bit and 16-bit in addition to the main 32-bit (I'm glossing over 64-bit mode and various other things for simplicity).

这些类型(char,int8_t,uint8_t,int16_t等)也直接受该体系结构支持,部分原因是为了与较旧的8086/286/386/etc向后兼容.等指令集.

These types (char, int8_t, uint8_t, int16_t etc.) are also directly supported by the architecture - partly for backwards compatibility with older 8086/286/386/etc. etc. instruction sets.

选择最小的自然固定大小类型肯定是一种好习惯,这是一种好习惯-它们仍然快速,单指令加载和存储,您仍然可以全速运行本机算法,甚至可以通过减少缓存未命中来提高性能.

It's certainly the case that choosing the smallest natural fixed-size type that will suffice, can be good practice - they're still quick, single instructions loads and stores, you still get full-speed native arithmetic, and you can even improve performance by reducing cache misses.

这与可变长度编码非常不同-我已经处理了其中的一些,它们太可怕了.每个加载都变成一个循环,而不是一条指令.每个商店也是一个循环.每个结构都是可变长度的,因此您不能自然地使用数组.

This is very different to variable-length encoding - I've worked with some of these, and they're horrible. Every load becomes a loop instead of a single instruction. Every store is also a loop. Every structure is variable-length, so you can't use arrays naturally.

在以后的评论中,就我所知,关于存储大小,您一直在使用有效"一词.有时我们确实选择最小化存储大小-当我们将大量的值保存到文件或通过网络发送它们时,这一点很重要.折衷方案是,我们需要将这些值加载到寄存器中,以便对其进行操作,并且执行转换不是免费的.

In subsequent comments, you've been using the word "efficient", as far as I can tell with respect to storage size. We do sometimes choose to minimize storage size - it can be important when we're saving very large numbers of values to files, or sending them over a network. The trade-off is that we need to load those values into registers to do anything with them, and performing the conversion isn't free.

当我们讨论效率时,我们需要知道我们正在优化什么,以及要权衡些什么.使用非本机存储类型是用处理速度换取空间的一种方法,有时是有道理的.使用可变长度存储(至少适用于算术类型),以更多处理速度(以及代码复杂性和开发人员时间)为代价,以节省空间,而这通常是最小的.

When we discuss efficiency, we need to know what we're optimizing, and what the trade-offs are. Using non-native storage types is one way to trade processing speed for space, and sometimes makes sense. Using variable-length storage (for arithmetic types at least), trades more processing speed (and code complexity and developer time) for an often-minimal further saving of space.

为此付出的速度损失意味着只有在需要绝对最小化带宽或长期存储时才值得,并且在这种情况下,使用简单自然的格式通常更容易-然后使用通用格式进行压缩用途的系统(例如zip,gzip,bzip2,xy或其他).

The speed penalty you pay for this means it's only worthwhile when you need to absolutely minimize bandwidth or long-term storage, and for those cases it's usually easier to use a simple and natural format - and then just compress it with a general-purpose system (like zip, gzip, bzip2, xy or whatever).

每个平台都有一个体系结构,但是您可以想出一种几乎无限数量的不同方式来表示数据.任何语言提供无限数量的内置数据类型都是不合理的.因此,C ++提供了对平台的本机,自然数据类型集的隐式访问,并允许您自己编写任何其他(非本机)表示形式.

Each platform has one architecture, but you can come up with an essentially unlimited number of different ways to represent data. It's not reasonable for any language to provide an unlimited number of built-in data types. So, C++ provides implicit access the platform's native, natural set of data types, and allows you to code any other (non-native) representation yourself.

这篇关于为什么类型总是不管大小而定大小?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆