SSE、内在函数和对齐 [英] SSE, intrinsics, and alignment

查看:25
本文介绍了SSE、内在函数和对齐的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用大量 SSE 编译器内在函数编写了一个 3D 矢量类.一切正常,直到我开始将具有 3D 矢量的类作为 new 的成员.我在发布模式下遇到了奇怪的崩溃,但在调试模式下则没有,反之亦然.

I've written a 3D vector class using a lot of SSE compiler intrinsics. Everything worked fine until I started to instatiate classes having the 3D vector as a member with new. I experienced odd crashes in release mode but not in debug mode and the other way around.

所以我阅读了一些文章并认为我也需要将拥有 3D 矢量类实例的类也对齐到 16 个字节.所以我只是在类前面添加了 _MM_ALIGN16 (__declspec(align(16)),如下所示:

So I read some articles and figured I need to align the classes owning an instance of the 3D vector class to 16 bytes too. So I just added _MM_ALIGN16 (__declspec(align(16)) in front of the classes like so:

_MM_ALIGN16 struct Sphere
{
    // ....

    Vector3 point;
    float radius
};

这似乎首先解决了问题.但是在更改了一些代码之后,我的程序又开始以奇怪的方式崩溃了.我在网上搜索了一些,发现了一个 博客 文章.我尝试了作者 Ernst Hot 所做的来解决问题,它也适用于我.我在类中添加了 new 和 delete 运算符,如下所示:

That seemed to solve the issue at first. But after changing some code my program started to crash in odd ways again. I searched the web some more and found a blog article. I tried what the author, Ernst Hot, did to solve the problem and it works for me too. I added new and delete operators to my classes like this:

_MM_ALIGN16 struct Sphere
{
    // ....

    void *operator new (unsigned int size)
     { return _mm_malloc(size, 16); }

    void operator delete (void *p)
     { _mm_free(p); }

    Vector3 point;
    float radius
};

Ernst 提到这种方法也可能有问题,但他只是链接到一个不再存在的论坛,而没有解释为什么会出现问题.

Ernst mentions that this aproach could be problematic as well, but he just links to a forum which does not exist anymore without explaining why it could be problematic.

所以我的问题是:

  1. 定义运算符有什么问题?

  1. What's the problem with defining the operators?

为什么在类定义中添加 _MM_ALIGN16 还不够?

Why isn't adding _MM_ALIGN16 to the class definition enough?

处理 SSE 内在函数带来的对齐问题的最佳方法是什么?

What's the best way to handle the alignment issues coming with SSE intrinsics?

推荐答案

首先你要关心两种类型的内存分配:

First of all you have to care for two types of memory allocation:

  • 静态分配.为了正确对齐自动变量,您的类型需要正确的对齐规范(例如 __declspec(align(16))__attribute__((aligned(16))) 或您的 _MM_ALIGN16).但幸运的是,只有在类型成员(如果有)给出的对齐要求不够时,您才需要这样做.所以你不需要这个 Sphere,因为你的 Vector3 已经正确对齐了.如果您的 Vector3 包含一个 __m128 成员(这很可能,否则我会建议这样做),那么您甚至不需要 Vector3.因此,您通常不必弄乱编译器特定的对齐属性.

  • Static allocation. For automatic variables to be properly aligned, your type needs a proper alignment specification (e.g. __declspec(align(16)), __attribute__((aligned(16))), or your _MM_ALIGN16). But fortunately you only need this if the alignment requirements given by the type's members (if any) are not sufficient. So you don't need this for you Sphere, given that your Vector3 is already aligned properly. And if your Vector3 contains an __m128 member (which is pretty likely, otherwise I would suggest to do so), then you don't even need it for Vector3. So you usually don't have to mess with compiler specific alignment attributes.

动态分配.简单的部分就这么多.问题是,C++ 在最低级别使用了一种与类型无关的内存分配函数来分配任何动态内存.这只能保证所有标准类型的正确对齐,这可能恰好是 16 个字节,但不能保证.

Dynamic allocation. So much for the easy part. The problem is, that C++ uses, on the lowest level, a rather type-agnostic memory allocation function for allocating any dynamic memory. This only guarantees proper alignment for all standard types, which might happen to be 16 bytes but isn't guaranteed to.

为此,您必须重载内置的operator new/delete 以实现您自己的内存分配并在引擎盖下使用对齐的分配函数而不是旧的malloc.重载 operator new/delete 本身就是一个主题,但并不像一开始看起来那么难(尽管您的示例还不够),您可以在 这个很好的常见问题解答.

For this to compensate you have to overload the builtin operator new/delete to implement your own memory allocation and use an aligned allocation function under the hood instead of good old malloc. Overloading operator new/delete is a topic on its own, but isn't that difficult as it might seem at first (though your example is not enough) and you can read about it in this excellent FAQ question.

不幸的是,对于具有任何需要非标准对齐的成员的每种类型,您都必须执行此操作,在您的情况下,SphereVector3.但是你可以做的让它更容易一些只是为这些运算符创建一个带有适当重载的空基类,然后从这个基类派生所有必要的类.

Unfortunately you have to do this for each type that has any member needing non-standard alignment, in your case both Sphere and Vector3. But what you can do to make it a bit easier is just make an empty base class with proper overloads for those operators and then just derive all neccessary classes from this base class.

大多数人有时容易忘记的是,标准分配器 std::alocator 使用全局 operator new 进行所有内存分配,因此您的类型将无法工作使用标准容器(并且 std::vector 并不是那么罕见的用例).您需要做的是制作自己的标准符合分配器并使用它.但是为了方便和安全,实际上更好的是为您的类型专门化 std::allocator(也许只是从您的自定义分配器派生它),以便它始终被使用并且您不需要关心每次使用 std::vector 时都使用正确的分配器.不幸的是,在这种情况下,您必须再次为每个对齐的类型专门化它,但一个小的邪恶宏对此有所帮助.

What most people sometimes tend to forget is that the standard allocator std::alocator uses the global operator new for all memory allocation, so your types won't work with standard containers (and a std::vector<Vector3> isn't that rare a use case). What you need to do is make your own standard conformant allocator and use this. But for convenience and safety it is actually better to just specialize std::allocator for your type (maybe just deriving it form your custom allocator) so that it is always used and you don't need to care for using the proper allocator each time you use a std::vector. Unfortunately in this case you have to again specialize it for each aligned type, but a small evil macro helps with that.

此外,您必须使用全局 operator new/delete 而不是自定义操作符来查找其他内容,例如 std::get_temporary_bufferstd::return_temporary_buffer,并在必要时照顾它们.

Additionally you have to look out for other things using the global operator new/delete instead of your custom one, like std::get_temporary_buffer and std::return_temporary_buffer, and care for those if neccessary.

不幸的是,我认为还没有更好的方法来解决这些问题,除非您使用的平台本身就与 16 并了解这一点.或者您可能只是重载全局 operator new/delete 以始终将每个内存块对齐到 16 个字节,并且无需关心包含 SSE 成员的每个类的对齐,但我不不知道这种方法的含义.在最坏的情况下,它应该只会导致浪费内存,但是你通常不会在 C++ 中动态分配小对象(尽管 std::liststd::map> 可能对此有不同的看法).

Unfortunately there isn't yet a much better approach to those problems, I think, unless you are on a platform that natively aligns to 16 and know about this. Or you might just overload the global operator new/delete to always align each memory block to 16 bytes and be free of caring for the alignment of each and every single class containing an SSE member, but I don't know about the implications of this approach. In the worst case it should just result in wasting memory, but then again you usually don't allocate small objects dynamically in C++ (though std::list and std::map might think differently about this).

总结一下:

  • 使用诸如 __declspec(align(16)) 之类的东西来注意静态内存的正确对齐,但前提是它尚未被任何成员处理,这通常是这种情况.

  • Care for proper alignment of static memory using things like __declspec(align(16)), but only if it is not already cared for by any member, which is usually the case.

重载 operator new/delete 为每个类型都有非标准对齐要求的成员.

Overload operator new/delete for each and every type having a member with non-standard alignment requirements.

制作一个自定义的符合标准的分配器,用于对齐类型的标准容器,或者更好的是,为每个对齐类型专门化 std::allocator.

Make a cunstom standard-conformant allocator to use in standard containers of aligned types, or better yet, specialize std::allocator for each and every aligned type.

最后是一些一般性建议.通常,在执行许多向量运算时,您只能在计算量大的块中从 SSE 中获利.为了简化所有这些对齐问题,特别是关心每个包含 Vector3 的类型的对齐问题,制作一个特殊的 SSE 向量类型并且只在内部使用它可能是一个很好的方法冗长的计算,使用普通的非 SSE 向量存储和成员变量.

Finally some general advice. Often you only profit form SSE in computation-heavy blocks when performing many vector operations. To simplify all this alignment problems, especially the problems of caring for the alignment of each and every type containing a Vector3, it might be a good aproach to make a special SSE vector type and only use this inside of lengthy computations, using a normal non-SSE vector for storage and member variables.

这篇关于SSE、内在函数和对齐的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆