现代处理器内存对齐? [英] Memory alignment on modern processors?

查看:164
本文介绍了现代处理器内存对齐?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常看到code,如以下时,如再$ P $内存psenting一个大的位图:

I often see code such as the following when, e.g., representing a large bitmap in memory:

size_t width = 1280;
size_t height = 800;
size_t bytesPerPixel = 3;
size_t bytewidth = ((width * bytesPerPixel) + 3) & ~3; /* Aligned to 4 bytes */
uint8_t *pixelData = malloc(bytewidth * height);

(即,分配为具有存储器的连续块的位图一 bytewidth 对齐到一定数目的字节,最常见的4。)

(that is, a bitmap allocated as a contiguous block of memory having a bytewidth aligned to a certain number of bytes, most commonly 4.)

在图像上的一个点,然后通过下式给出:

A point on the image is then given via:

pixelData + (bytewidth * y) + (bytesPerPixel * x)

这让我两个问题:

  1. 对齐是否像这样一个缓冲对现代处理器的性能影响?我应该担心对齐所有,或将编译器处理呢?
  2. 如果它确实有一定的影响,可能有人点我的资源找到理想的字节对齐各种处理器?

感谢。

推荐答案

这取决于很多因素。如果你只是在一个时间访问像素数据一个字节,校准将没有任何区别的大多数时间。对于读/写数据的一个字节,大多数处理器不会关心在所有的字节是否在4字节边界或不

It depends on a lot of factors. If you're only accessing the pixel data one byte at a time, the alignment will not make any difference the vast majority of the time. For reading/writing one byte of data, most processors won't care at all whether that byte is on a 4-byte boundary or not.

不过,如果你正在访问的单位数据大于一个字节(例如,在2个字节或4个字节为单位),那么你一定会看到调整效果。对于某些处理器(例如很多RISC处理器),它是彻头彻尾的非法访问某些层面未对齐的数据:试图从一个地址,这不是4字节对齐会生成一个数据访问异常读一个4字节字(或数据存储异常)上的PowerPC,例如

However, if you're accessing data in units larger than a byte (say, in 2-byte or 4-byte units), then you will definitely see alignment effects. For some processors (e.g. many RISC processors), it is outright illegal to access unaligned data on certain levels: attempting to read a 4-byte word from an address that's not 4-byte aligned will generate a Data Access Exception (or Data Storage Exception) on a PowerPC, for example.

在其它处理器(如86),访问未对齐的地址是允许的,但它往往与一个隐藏的性能损失。内存加载/存储经常在微code实现的,而微code将检测对齐访问。通常情况下,微code会从内存中取适当的4个字节的数量,但如果它不对齐,它必须获取的两个的4个字节的内存位置和重建所需的4-来自两个地点的适当字节字节数量。抓取两个存储单元显然不止一个慢。

On other processors (e.g. x86), accessing unaligned addresses is permitted, but it often comes with a hidden performance penalty. Memory loads/stores are often implemented in microcode, and the microcode will detect the unaligned access. Normally, the microcode will fetch the proper 4-byte quantity from memory, but if it's not aligned, it will have to fetch two 4-byte locations from memory and reconstruct the desired 4-byte quantity from the appropriate bytes of the two locations. Fetching two memory locations is obviously slower than one.

这只是简单的加载和存储,虽然。有些指令,例如那些在MMX或SSE指令集,要求其存储器操作数被正确对准。如果您尝试使用这些特殊的指令来访问未对齐的内存,你会看到类似这样的非法指令异常。

That's just for simple loads and stores, though. Some instructions, such as those in the MMX or SSE instruction sets, require their memory operands to be properly aligned. If you attempt to access unaligned memory using those special instructions, you'll see something like an illegal instruction exception.

要总结,我不会真的太担心调整,除非你正在编写的超性能的关键code(例如,在组装)。编译器可以帮助你很多,例如通过填充结构,使得4字节的数量排列的4个字节的边界,并在x86的CPU也可以帮助你在与未对齐访问处理。既然你处理像素数据的3个字节的数量,你几乎总是被做单字节访问反正。

To summarize, I wouldn't really worry too much about alignment unless you're writing super performance-critical code (e.g. in assembly). The compiler helps you out a lot, e.g. by padding structures so that 4-byte quantities are aligned on 4-byte boundaries, and on x86, the CPU also helps you out when dealing with unaligned accesses. Since the pixel data you're dealing with is in quantities of 3 bytes, you'll almost always being doing single byte accesses anyways.

如果您决定,而不是要访问奇异4个字节的访问像素(相对于3 1字节访问),这将是最好使用32位像素,并有一个4字节的每个像素排列边界。对准每一行4个字节的边界,但不每个像素将很少,如果有的话,效果

If you decide you instead want to access pixels in singular 4-byte accesses (as opposed to 3 1-byte accesses), it would be better to use 32-bit pixels and have each individual pixel aligned on a 4-byte boundary. Aligning each row to a 4-byte boundary but not each pixel will have little, if any, effect.

根据您的code,我猜它涉及到读取Windows位图文件格式 - 位图文件要求每个扫描线的长度是4字节的倍数,所以设置了您的像素数据缓冲区该酒店有,你可以只读取整个位图中的一个财产一举进入您的缓冲区(当然,你仍然不得不面对的事实是,扫描线存储的底部到顶部,而不是顶至底与该像素数据是BGR代替RGB)。这是不是真的太大的优势,但 - 这是不是更难读出位图中的1扫描线的时间

Based on your code, I'm guessing it's related to reading the Windows bitmap file format -- bitmap files require the length of each scanline to be a multiple of 4 bytes, so setting up your pixel data buffers with that property has the property that you can just read in the entire bitmap in one fell swoop into your buffer (of course, you still have to deal with the fact that the scanlines are stored bottom-to-top instead of top-to-bottom and that the pixel data is BGR instead of RGB). This isn't really much of an advantage, though -- it's not that much harder to read in the bitmap one scanline at a time.

这篇关于现代处理器内存对齐?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆