未对齐的内存访问 [英] unaligned memory accesses

查看:154
本文介绍了未对齐的内存访问的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的工作不支持未对齐的内存访问的嵌入式设备上。

I'm working on an embedded device that does not support unaligned memory accesses.

对于视频去codeR我必须处理像素(每像素一个字节)在8×8像素块。该设备具有一定的SIMD处理能力,让我来并行处理4个字节。

For a video decoder I have to process pixels (one byte per pixel) in 8x8 pixel blocks. The device has some SIMD processing capabilities that allow me to work on 4 bytes in parallel.

的问题是,即8×8像素块不保证开始上对准的地址和功能需要读/写多达三个这些8×8块。

The problem is, that the 8x8 pixel blocks aren't guaranteed to start on an aligned address and the functions need to read/write up to three of these 8x8 blocks.

你会如何处理这个,如果你想很好的表现?经过一番思考,我想出了以下三种思路:

How would you approach this if you want very good performance? After a bit of thinking I came up with the following three ideas:

  1. 是否所有的内存字节访问。这是为了做到这一点,但是进展缓慢的最简单的方法,它不与SIMD capabilites正常工作(这就是我现在做我的参照C-code)。

  1. Do all memory accesses as bytes. This is the easiest way to do it but slow and it does not work well with the SIMD capabilites (it's what I'm currently do in my reference C-code).

写四篇复制功能(每个对准的情况下),其加载通过两个32位读操作,却​​将位到正确的位置和写入数据的临时内存部分对准块中的像素数据。然后,视频处理功能,可以使用32位访问和SIMD。缺点:CPU将没有机会隐藏处理后面的内存延时

Write four copy-functions (one for each alignment case) that load the pixel-data via two 32-bit reads, shift the bits into the correct position and write the data to some aligned chunk of scratch memory. The video processing functions can then use 32 bit accesses and SIMD. Drawback: The CPU will have no chance to hide the memory latency behind the processing.

同样的想法与上述的,而是写入的像素以临时存储器做代替视频处理。这可能是最快的方式,但是功能的数量,我必须写这种做法是很高(约60我猜的)。

Same idea as above, but instead of writing the pixels to scratch memory do the video-processing in place. This may be the fastest way, but the number of functions that I have to write for this approach is high (around 60 I guess).

顺便说一句:我会写的所有功能汇编,因为编译器会产生可怕的code,当涉及到SIMD扩展。

Btw: I will have to write all functions in assembler because the compiler generates horrible code when it comes to the SIMD extension.

你会采取,或者你有另一个想法如何处理这哪条路?

Which road would you take, or do you have another idea how to approach this?

推荐答案

您应该首先打破你的code到获取/处理部分。

You should first break your code into fetch/processing sections.

取指令code应该复制到工作缓冲区,有一个是对齐的情况下为内存(在这里你应该能够使用SIMD寄存器复制)和不结盟的内存,你需要按字节复制字节(如果你的平台,不能做对齐访问,和源/目标有不同的路线,那么这是最好的,你可以做的)。

The fetch code should copy into a working buffer and have cases for for memory that is aligned (where you should be able to copy using the SIMD registers) and non-aligned memory where you need to copy byte by byte (if your platform can't do unaligned access, and your source/dest have different alignments, then this is the best you can do).

您处理code就可以SIMD与工作对齐数据的保证。对于处理的任何真实度做一个副本+过程绝对会比未对齐的数据非SIMD运算速度更快。

Your processing code can then be SIMD with the guarantee of working on aligned data. For any real degree of processing doing a copy+process will definitely be faster than non-SIMD operations on unaligned data.

假设你的源和放大器;目标寄存器是相同的,一个进一步的优化将是只使用工作缓冲器如果源是未对齐的,并执行处理的就地如果存储器的对准。这样做的好处将取决于您的数据的特点。

Assuming your source & dest are the same, a further optimization would be to only use the working buffer if the source is unaligned, and do the processing in-place if the memory's aligned. The benefits of this will depend upon the characteristics of your data.

根据您的架构,你可以处理前获得通过prefetching数据更多的利益。在这里,您可以在需要之前就发出指令获取的内存区域到缓存中,这样你会发出处理当前之前获取下一个块。

Depending on your architecture you may get further benefits by prefetching data before processing. This is where you can issue instructions to fetch areas of memory into the cache before they're needed, so you would issue a fetch for the next block before processing the current.

这篇关于未对齐的内存访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆