数据对齐以实现矢量化/高效的缓存访问 [英] Data alignment to enable vectorization / efficient cache access

查看：96 发布时间：2020/10/6 23:40:51 c++ vectorization compiler-optimization simd memory-alignment

本文介绍了数据对齐以实现矢量化/高效的缓存访问的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这本书中的内容如下：

对于Knights Landing，当数据以$ b $开始时，内存移动是最佳的b地址位于64字节边界上。

Q1。有没有一种方法可以动态地用C ++代码查询处理器，以了解对于当前正在运行应用程序的处理器，最佳的 n 字节边界是什么？这样，代码就可以移植了。

该书进一步指出：

作为程序员，我们最终要完成两项工作：（1）对齐我们的数据，（2）确保
确保编译器知道其对齐。

（假设以下问题，我们知道对于处理器而言，以64字节边界开始数据是最佳选择。）

什么

假设我有这样的课程：

  class Class1_ {
 private：
 int a; // 4字节
 double b; // 8字节
 std :: vector< int> potential_longish_vector_int; 
 std :: vector< double> potential_longish_vector_double; 
 double *潜在的_longish_heap_array_double; 
 public：
 //-东西--- // 
 double * return_heap_array_address（）{返回return_long_longish_heap_array_double;} 
} 
  
 
 
 假设我也有这样的函数原型：
  void func1（Class1_& obj_class1）; 
 
 void func2（double * array）; 
  
即 func1 接受一个对象 Class1 _ 的引用，而 func2 的名称称为 func2（obj_class1.return_heap_array_address（）） ;  
 
 
 要与数据应适当边界对齐的建议保持一致，应 obj_class1 本身是否为64字节边界对齐，以便有效运行 func1（）？  
应该为 func2（）的有效运行而将 potentially_longish_heap_array_double 对齐64字节边界吗？ 
 
为使该类的其他数据成员成为STL容器，请使用线程建议如何完成所需的对齐。
 
 
 第二季度。因此，对象本身以及其中的所有数据成员是否都需要适当对齐？
解决方案
当您在高速缓存行边界上对齐阵列时，可以最大程度地利用高速缓存，并且还可以使阵列适合于任何SIMD指令。这是因为RAM和CPU高速缓存之间的传输单位是高速缓存行，在现代Intel CPU上为64字节。
 
 
 但是，增加的对齐方式也可能会浪费内存和内存。降低缓存利用率。通常，仅在应用程序的关键快速路径上的数据结构可能需要指定增加的对齐方式。 
 
 
 按{hotness，size}顺序排列类的成员是有意义的，这样最常访问的成员或一起访问的成员都位于同一缓存行中。 
 
 
 此处的优化目标是减少缓存和TLB丢失（或减少每条指令的周期/增加每条指令的周期）。使用大页面可以减少TLB遗漏。
 
This book says the following:

  For Knights Landing, memory movement is optimal when the data starting
  address lies on 64-byte boundaries.
Q1. Is there a way to query the processor in C++ code dynamically to know what this optimal n-byte boundary would be for the processor on which the application is currently running? That way, the code would be portable.

The book further states:

  As programmers, we end up with two jobs: (1)align our data and (2)make
  sure the compiler knows it is aligned.
(Suppose for the question below that we know that it is optimal for our processor to have data start at 64-byte boundaries.)

What exactly is this "data" though?

Suppose I have a class thus:
class Class1_{
    private: 
    int a;//4 bytes
    double b;//8 bytes
    std::vector<int> potentially_longish_vector_int;
    std::vector<double> potentially_longish_vector_double;
    double * potentially_longish_heap_array_double;
    public:
    //--stuff---//
    double * return_heap_array_address() {return potentially_longish_heap_array_double;}
}
Suppose I also have functions that are prototyped thus:
void func1(Class1_& obj_class1);

void func2(double* array);
That is, func1 takes in an object of Class1_ by reference, and func2 is called as func2(obj_class1.return_heap_array_address());

To be consistent with the advice that data should be appropriately boundary aligned, should obj_class1 itself be 64-byte boundary aligned for efficient functioning of func1()? Should potentially_longish_heap_array_double be 64-byte boundary aligned for efficient functioning of func2()?

For alignment of other data members of the class which are STL containers, the thread here suggests how to go about accomplishing the required alignment.

Q2. So, does the object itself need to be appropriately aligned as well as all of the data members within it?
 解决方案 
In general, when you align your arrays on a cache line boundary that maximises cache utilisation and that also makes the arrays suitably aligned for any SIMD instructions. That is because the unit of transfer between RAM and CPU caches is a cache line, which is 64 bytes on modern Intel CPUs.

However, increased alignment may also waste memory and reduce cache utilization. Normally only data structures on the critical fast path of your application may require specifying an increased alignment. 

It makes sense to arrange members of your classes in {hotness, size} order, so that most frequently accessed members or members accessed together reside on the same cache line. 

Optimization objective here is to reduce cache and TLB misses (or, decrease cycles-per-instruction / increase instructions-per-cycle). TLB misses can be reduced by using huge pages.

                        这篇关于数据对齐以实现矢量化/高效的缓存访问的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

数据对齐以实现矢量化/高效的缓存访问 [英] Data alignment to enable vectorization / efficient cache access

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

数据对齐以实现矢量化/高效的缓存访问 [英] Data alignment to enable vectorization / efficient cache access

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭