std::vector<Simd_wrapper>内存中有连续的数据吗? [英] Does std::vector<Simd_wrapper> have contiguous data in memory?

查看:23
本文介绍了std::vector<Simd_wrapper>内存中有连续的数据吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

class Wrapper {民众://一些对值进行操作的函数___m128i 值_;};int main() {std::vector一种;a.调整大小(100);}

vector aWrapper 对象的value_ 属性是否总是占用连续内存,__m128i 值之间没有任何间隙 ?

我的意思是:

[第一个包装器的 128 位][此处没有间隙][第二个包装器的 128 位] ...

到目前为止,对于我正在使用的 g++ 和 Intel cpu 以及 gcc godbolt,这似乎是正确的.

既然 Wrapper 对象中只有一个 __m128i 属性,这是否意味着编译器总是不需要在内存中添加任何类型的填充?(POD对象向量的内存布局)

测试代码 1:

#include #include <向量>#include int main(){静态 constexpr size_t N = 1000;std::vector<__m128i>一种;a.调整大小(1000);//__m128i a[1000];uint32_t* ptr_a = reinterpret_cast(a.data());for (size_t i = 0; i <4*N; ++i)ptr_a[i] = i;for (size_t i = 1; i 

警告:

警告:忽略模板参数的属性'__m128i {aka __vector(2) long long int}'[-Wignored-属性]

组装 (gcc God bolt):

.L9:添加 rax, 16movdqa xmm1, XMMWORD PTR [rax]pand xmm0, xmm1movaps XMMWORD PTR [rax-16], xmm0cmp rax, rdxmovdqa xmm0, xmm1.L9

我猜这意味着数据是连续的,因为循环只是将 16 个字节添加到它在循环的每个循环中读取的内存地址.它使用 pand 来做按位和.

测试代码 2:

#include #include <向量>#include 类包装{民众:__m128i 值_;内联包装器&运算符 &= (const Wrapper& rhs){value_ = _mm_and_si128(value_, rhs.value_);}};//包装器int main(){静态 constexpr size_t N = 1000;std::vector一种;a.调整大小(N);//__m128i a[1000];uint32_t* ptr_a = reinterpret_cast(a.data());for (size_t i = 0; i <4*N; ++i) ptr_a[i] = i;for (size_t i = 1; i 

组装 (gcc God bolt)

.L9:添加 rdx, 2添加 rax, 32movdqa xmm1, XMMWORD PTR [rax-16]pand xmm0, xmm1movaps XMMWORD PTR [rax-32], xmm0movdqa xmm0, XMMWORD PTR [rax]pand xmm1, xmm0movaps XMMWORD PTR [rax-16], xmm1cmp rdx, 999.L9

看起来也没有填充.rax 在每一步增加 32,也就是 2 x 16.额外的 add rdx,2 绝对不如测试代码 1 的循环.>

测试自动矢量化

#include #include <向量>#include int main(){静态 constexpr size_t N = 1000;std::vector<__m128i>一种;a.调整大小(1000);//__m128i a[1000];uint32_t* ptr_a = reinterpret_cast(a.data());for (size_t i = 0; i <4*N; ++i)ptr_a[i] = i;for (size_t i = 1; i 

组装 (god bolt):

.L21:movdqu xmm0, XMMWORD PTR [r10+rax]添加 rdi, 1pand xmm0, XMMWORD PTR [r8+rax]movaps XMMWORD PTR [r8+rax], xmm0添加 rax, 16cmp rsi, rdija .L21

...我只是不知道这对于 intel cpu 和 g++/intel c++ 编译器是否总是正确/(在此处插入编译器名称)...

解决方案

无填充在实践中是安全的,除非您为非标准 ABI 进行编译.

针对相同 ABI 的所有编译器必须对结构/类大小/布局做出相同的选择,并且所有标准 ABI/调用约定都不会在您的结构中填充.(即 x86-32 和 x86-64 System V 和 Windows,请参阅 标记维基链接).您对一个编译器的实验证实了它适用于所有面向同一平台/ABI 的编译器.

请注意,此问题的范围仅限于支持 Intel 内在函数和 __m128i 类型的 x86 编译器,这意味着我们有比仅从 ISO C++ 标准中获得的更强大的保证,而没有任何特定于实现的东西.

<小时>

正如@zneak 指出的,你可以在 def 类中 static_assert(std::is_standard_layout::value) 提醒人们不要添加任何虚方法,这会添加一个 vtable指向每个实例的指针.

class Wrapper {
public:
    // some functions operating on the value_
    __m128i value_;
};

int main() {
    std::vector<Wrapper> a;
    a.resize(100);
}

Would the value_ attribute of the Wrapper objects in the vector a always occupy contiguous memory without any gaps between the __m128i values ?

I mean:

[128 bit for 1st Wrapper][no gap here][128bit for 2nd Wrapper] ...

So far, this seems to be true for g++ and the Intel cpu I am using, and gcc godbolt.

Since there is only a single __m128i attribute in the Wrapper object, does that mean the compiler always do not need to add any kind of padding in memory? (Memory layout of vector of POD objects)

Test code 1:

#include <iostream>
#include <vector>
#include <x86intrin.h>

int main()
{
  static constexpr size_t N = 1000;
  std::vector<__m128i> a;
  a.resize(1000);
  //__m128i a[1000];
  uint32_t* ptr_a = reinterpret_cast<uint32_t*>(a.data());
  for (size_t i = 0; i < 4*N; ++i)
    ptr_a[i] = i;
  for (size_t i = 1; i < N; ++i){
    a[i-1] = _mm_and_si128 (a[i], a[i-1]);
  }
  for (size_t i = 0; i < 4*N; ++i)
    std::cout << ptr_a[i];
}

Warning:

warning: ignoring attributes on template argument 
'__m128i {aka __vector(2) long long int}'
[-Wignored-attributes]

Assembly (gcc god bolt):

.L9:
        add     rax, 16
        movdqa  xmm1, XMMWORD PTR [rax]
        pand    xmm0, xmm1
        movaps  XMMWORD PTR [rax-16], xmm0
        cmp     rax, rdx
        movdqa  xmm0, xmm1
        jne     .L9

I guess this means the data is contiguous because the loop just add 16 bytes to the memory address it reads in every cycle of the loop. It is using pand to do the bitwise and.

Test code 2:

#include <iostream>
#include <vector>
#include <x86intrin.h>
class Wrapper {
public:
    __m128i value_;
    inline Wrapper& operator &= (const Wrapper& rhs)
    {
        value_ = _mm_and_si128(value_, rhs.value_);
    }
}; // Wrapper
int main()
{
  static constexpr size_t N = 1000;
  std::vector<Wrapper> a;
  a.resize(N);
  //__m128i a[1000];
  uint32_t* ptr_a = reinterpret_cast<uint32_t*>(a.data());
  for (size_t i = 0; i < 4*N; ++i) ptr_a[i] = i;
  for (size_t i = 1; i < N; ++i){
    a[i-1] &=a[i];
    //std::cout << ptr_a[i];
  }
  for (size_t i = 0; i < 4*N; ++i)
    std::cout << ptr_a[i];
}

Assembly (gcc god bolt)

.L9:
        add     rdx, 2
        add     rax, 32
        movdqa  xmm1, XMMWORD PTR [rax-16]
        pand    xmm0, xmm1
        movaps  XMMWORD PTR [rax-32], xmm0
        movdqa  xmm0, XMMWORD PTR [rax]
        pand    xmm1, xmm0
        movaps  XMMWORD PTR [rax-16], xmm1
        cmp     rdx, 999
        jne     .L9

Looks like no padding too. rax increases by 32 in each step, and that is 2 x 16. That extra add rdx,2 is definitely not as good as the loop from test code 1.

Test auto-vectorization

#include <iostream>
#include <vector>
#include <x86intrin.h>

int main()
{
  static constexpr size_t N = 1000;
  std::vector<__m128i> a;
  a.resize(1000);
  //__m128i a[1000];
  uint32_t* ptr_a = reinterpret_cast<uint32_t*>(a.data());
  for (size_t i = 0; i < 4*N; ++i)
    ptr_a[i] = i;
  for (size_t i = 1; i < N; ++i){
    a[i-1] = _mm_and_si128 (a[i], a[i-1]);
  }
  for (size_t i = 0; i < 4*N; ++i)
    std::cout << ptr_a[i];
}

Assembly (god bolt):

.L21:
        movdqu  xmm0, XMMWORD PTR [r10+rax]
        add     rdi, 1
        pand    xmm0, XMMWORD PTR [r8+rax]
        movaps  XMMWORD PTR [r8+rax], xmm0
        add     rax, 16
        cmp     rsi, rdi
        ja      .L21

... I just don't know if this is always true for intel cpu and g++/intel c++ compilers/(insert compiler name here) ...

解决方案

No-padding is safe to assume in practice, unless you're compiling for a non-standard ABI.

All compilers targeting the same ABI must make the same choice about struct/class sizes / layouts, and all the standard ABIs / calling conventions will have no padding in your struct. (i.e. x86-32 and x86-64 System V and Windows, see the tag wiki for links). Your experiments with one compiler confirm it for all compilers targeting the same platform/ABI.

Note that the scope of this question is limited to x86 compilers that support Intel's intrinsics and the __m128i type, which means we have much stronger guarantees than what you get from just the ISO C++ standard without any implementation-specific stuff.


As @zneak points out, you can static_assert(std::is_standard_layout<Wrapper>::value) in the class def to remind people not to add any virtual methods, which would add a vtable pointer to each instance.

这篇关于std::vector&lt;Simd_wrapper&gt;内存中有连续的数据吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆