SoA / AoS内存布局的C ++零成本抽象 [英] C++ zero-cost abstraction for SoA/AoS memory layouts

查看:265
本文介绍了SoA / AoS内存布局的C ++零成本抽象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我使用结构数组(AoS)内存布局编写了大量代码。我想用C ++构建一个零成本的抽象,它使我能够以尽可能少的重构工作在AoS和SoA之间进行切换。
例如带访问功能的类

 结构项{
auto& myDouble(){return mDouble; }
自动& myChar(){返回mChar; }
自动& myString(){return mString; }
私人:
double mDouble;
char mChar;
std :: string mString;
};

在容器内循环使用

  std :: vector< Item> vec_(1000); 
for(auto& i:vec_)
i.myDouble()= 5。

我想更改第一个代码段,而第二个代码段保持相似。

  MyContainer< Item,SoA> vec_(1000)
用于(auto& i:vec_)
i.myDouble()= 5 .;

其中,我可以使用 SoA或 AoS模板参数选择内存布局。我的问题是:这样的东西存在于某处吗?如果没有,那么如何实现呢?

解决方案

我实现了一个通用解决方案,我将解释它在下面(将是一长篇文章)。当然,这不是唯一可能的答案,收集反馈会很棒。我在 https://github.com/crosetto/SoAvsAoS

我们创建两个帮助器类,根据标签模板参数,给定一个项目,该类将容器类型生成为元组向量或向量元组。我们将此类称为DataLayoutPolicy,并将使用它,例如

  DataLayoutPolicy< std :: vector,SoA,char,double,std :: string> 

生成char,int和double的向量元组。

  enum class DataLayout {SoA,//数组结构
AoS //结构数组
};
template< template< typename ...>类Container,DataLayout TDataLayout,类型名TItem>
struct DataLayoutPolicy;

此类仅包含与容器进行交互的静态成员函数(例如,提取元素,插入,调整大小等)。我们编写了两个模板专长。第一个(琐碎的)结构数组情况:

  template< template< typename ...>类Container,template< typename ...>类TItem,类型名... Types> 
struct DataLayoutPolicy<容器,DataLayout :: AoS,TItem< Types ...>> {
using type = Container< TItem< Types ...>> ;;
使用value_type = TItem< Types ...>& ;;

constexpr静态value_type get(type& c_,std :: size_t position_){return value_type(* static_cast< TItem< Types ...> *>(& c_ [position_])) ; }

constexpr static void resize(type& c_,std :: size_t size_){c_.resize(size_); }

模板< typename TValue>
constexpr static void push_back(type& c_,TValue& val_){c_.push_back(val_); }
静态constexpr std :: size_t size(type& c _){return c_.size(); }
};

...只是转发。我们对数组案例的结构也做同样的事情。



注意:下面的代码需要解释一些事情。



它包装了所有类型以ref_wrap类型,即装饰的std :: reference_wrapper。这是因为我们要访问元素作为左值引用,以便能够更改其值。使用常规参考,如果例如类型包含任何引用。值得注意的是,在AoS情况下,DataLayoutPolicy :: value_type是引用,而在SoA情况下,是ref_wrap类型的值。



我们返回值新创建的值ref_wrap元组。出乎意料的是,这是因为编译器正在优化所有副本,并且在C ++ 17中它甚至还可以(返回的元组为'prvalue'),这是因为在标准中添加了保证的副本省略:不复制,即使std :: tuple和std :: reference_wrapper没有复制/移动构造函数,此代码也可以使用。



我们使用std :: integer静态展开参数包的顺序:这很丑陋,但这是从C ++ 14开始的方式(在C ++ 11中,必须使用模板递归来实现此目的)。还没有像参数包那样的 for_each之类的东西。



我们使用C ++ 17折叠表达式来调用多次返回void的函数。

  template< typename T>可以用C ++ 17简洁地实现。 
struct ref_wrap:public std :: reference_wrapper< T> {
运算符T&()const noexcept {return this-> get(); }
ref_wrap(T& other_):std :: reference_wrapper< T>(other _){}
void运算符=(T&& other_){this-> get()= other_; }
};

template< template< typename ...>类Container,template< typename ...>类TItem,类型名... Types>
struct DataLayoutPolicy<容器,DataLayout :: SoA,TItem< Types ...>> {
using type = std :: tuple< Container< Types> ...> ;;
使用value_type = TItem< ref_wrap< Types> ...> ;;

constexpr静态值类型get(type& c_,std :: size_t position_)
{
return doGet(c_,position_,std :: make_integer_sequence< unsigned,sizeof ... (类型)>()); //展开参数包
}

constexpr静态void resize(type& c_,std :: size_t size_){
doResize(c_,size_,std :: make_integer_sequence< unsigned ,sizeof ...(Types)>()); //展开参数包
}

模板< typename TValue>
constexpr static void push_back(type& c_,TValue& val_){
doPushBack(c_,std :: forward< TValue>(val_),std :: make_integer_sequence< unsigned,sizeof ... (类型)>()); //展开参数包
}

静态constexpr std :: size_t size(type& c _){return std :: get< 0>(c_).size(); }

私人:

模板< unsigned ... Ids>
constexpr静态自动doGet(type& c_,std :: size_t position_,std :: integer_sequence< unsigned,Ids ...>)
{
return value_type {ref_wrap(std :: get< Ids>(c_)[position_])...}; //保证复制省略
}

模板< unsigned ... Ids>
constexpr static void doResize(type& c_,unsigned size_,std :: integer_sequence< unsigned,Ids ...>)
{
(std :: get< Ids>(c_) .resize(size_),...); //折叠表达式
}

模板< typename TValue,unsigned ... Ids>
constexpr static void doPushBack(type& c_,TValue& val_,std :: integer_sequence< unsigned,Ids ...>)
{
(std :: get< Ids> (c_).push_back(std :: get< Ids>(std :: forward< TValue>(val_))),...); //折叠表达式
}
};

所以现在这段代码非常清楚地显示了如何构建这种抽象。我们在下面显示了使用它的可能策略。我们使用DataLayoutPolicy和通用的TItem类型定义policy_t类型。

  template< template< typename T>类TContainer,DataLayout TDataLayout,类型名TItem>使用policy_t = DataLayoutPolicy< TContainer,TDataLayout,TItem> ;; 
;

容器类将大多数调用转发给policy_t类型定义的静态函数。可能看起来像以下内容

  template< template< typename ValueType>类TContainer,DataLayout TDataLayout,类型名TItem> 
struct BaseContainer
{
/ *成员函数,例如puhs_back,resize,... * /
value_type运算符[](std :: size_t position_)
{
return policy_t :: get(mValues,position_);
}

迭代器begin(){return iterator(this,0); }
iterator end(){return iterator(this,size()); }

私人:

typename policy_t :: type mValues;

};

现在这不是标准容器,因此我们必须定义一个迭代器才能在STL中使用它算法。我们构建的迭代器看起来像是元组容器的STL迭代器,除了它必须包含对容器的引用这一事实,因为当我们调用解引用运算符时,我们想调用存储的operator [],它静态地调度容器的引用。

  template< typename TContainer>操作。 
类迭代器
{

private:
使用container_t = TContainer;
public:

/ * ...常规迭代器成员函数和类型定义... * /

template< typename TTContainer>
Iterator(TTContainer * container_,std :: size_t position_ = 0):
mContainer(container_)
,mIterPosition(position_)
{
}

value_type运算符*(){
return(* mContainer)[mIterPosition];
}

私人:
container_t * mContainer = nullptr;
std :: size_t mIterPosition = std :: numeric_limits< std :: size_t> :: infinity();
};

最终,我们定义了 item数据结构:将其作为std :: tuple的装饰器

  template< typename ... T>,具有某些特定的成员函数(在这种情况下,仅是getter / setter)。 
struct项目:public std :: tuple< T ...> {
使用std :: tuple< T ...> :: tuple;
自动& myDouble(){返回std :: get< 0>(* this);}
自动& myChar(){返回std :: get< 1>(* this);}
自动& myString(){返回std :: get< 2>(* this);}
};

调用Item的成员函数时,我们必须依赖编译器优化才能使抽象成为零成本:我们不希望调用Item构造函数,因为我们正在创建一个临时元组,每次只访问其中一个成员,然后立即对其进行处理。



所以最终我们可以编写程序:

  template< typename T> 
使用MyVector = std :: vector< T,std :: allocator< T> ;;

int main(int argc,char ** argv){
使用container_t = BaseContainer< MyVector,DataLayout :: SoA,Item< double,char,std :: string,Pad> > ;;
container_t container_(1000);

for(auto& i:container _){
i.myDouble()= static_cast< double>(argc);
}

我们可以编写通用且有效的代码,而无需考虑其下面的内存布局。剩下要做的就是检查这是否是零成本抽象。对我而言,最简单的检查方法是使用调试器:使用调试符号编译示例,

 > clang ++ -std = c ++ 1z -O3 -g main.cpp -o test 

运行gdb,在for循环中设置brakpoint,然后逐步执行汇编指令(layout split命令同时显示源代码和反汇编指令)

 > gdb test 
(gdb)break main.cpp:10#在循环内设置断点
(gdb)运行#执行直到断点
(gdb)布局拆分#显示汇编和源代码2个独立的帧
(gdb)stepi#执行一条指令

循环是在AoS数据布局的情况下

  0x400b00< main(int,char **)+ 192> movsd%xmm0,(%rsi)
0x400b04< main(int,char **)+ 196>加$ 0x610,%rsi
0x400b0b< main(int,char **)+ 203>加$ 0xffffffffffffffff,%rcx
0x400b0f< main(int,char **)+ 207> jne 0x400b00< main(int,char **)+ 192>

请特别注意,在第二行中,要计算地址的偏移量为0x160。这将根据item对象中数据成员的大小而变化。另一方面,对于SoA数据结构,我们有

  0x400b60< main(int,char **)+ 224> movups%xmm1,(%rdi,%rsi,8)
0x400b64< main(int,char **)+ 228> movups%xmm1,0x10(%rdi,%rsi,8)
0x400b69< main(int,char **)+ 233>波动%xmm1,0x20(%rdi,%rsi,8)
0x400b6e< main(int,char **)+ 238>波动%xmm1,0x30(%rdi,%rsi,8)
0x400b73< main(int,char **)+ 243>波动%xmm1,0x40(%rdi,%rsi,8)
0x400b78< main(int,char **)+ 248>波动%xmm1,0x50(%rdi,%rsi,8)
0x400b7d< main(int,char **)+ 253>波动%xmm1,0x60(%rdi,%rsi,8)
0x400b82< main(int,char **)+ 258>变动%xmm1,0x70(%rdi,%rsi,8)
0x400b87< main(int,char **)+ 263> movups%xmm1,0x80(%rdi,%rsi,8)
0x400b8f< main(int,char **)+ 271>波动%xmm1,0x90(%rdi,%rsi,8)
0x400b97< main(int,char **)+ 279> movups%xmm1,0xa0(%rdi,%rsi,8)
0x400b9f< main(int,char **)+ 287>波动%xmm1,0xb0(%rdi,%rsi,8)
0x400ba7< main(int,char **)+ 295> movups%xmm1,0xc0(%rdi,%rsi,8)
0x400baf< main(int,char **)+ 303> movups%xmm1,0xd0(%rdi,%rsi,8)
0x400bb7< main(int,char **)+ 311> movups%xmm1,0xe0(%rdi,%rsi,8)
0x400bbf< main(int,char **)+ 319>上移%xmm1,0xf0(%rdi,%rsi,8)
0x400bc7< main(int,char **)+ 327>加$ 0x20,%rsi
0x400bcb< main(int,char **)+ 331>加$ 0x8,%rbx
0x400bcf< main(int,char **)+ 335> jne 0x400b60< main(int,char **)+ 224>

我们看到循环是由Clang展开和向量化的(版本6.0.0),增量为地址为0x20,与项目struct中存在的数据成员数量无关。


Say I have a large code using Array of Structures (AoS) memory layout. I would like to build a zero-cost abstraction in C++ which allows me to switch between AoS and SoA with as little refactoring effort as possible. For instance take a class with access member functions

 struct Item{
   auto& myDouble(){ return mDouble; }
   auto& myChar(){ return mChar; }
   auto& myString(){ return mString; }
 private:
   double mDouble;
   char mChar;
   std::string mString;
 };

which is used inside a container in a loop

std::vector<Item> vec_(1000);
for (auto& i : vec_)
  i.myDouble()=5.;

I would like to change the first snippet while the second one remains similar.. e.g. having something like

MyContainer<Item, SoA> vec_(1000)
for (auto& i : vec_)
  i.myDouble()=5.;

in which I can select the memory layout with a "SoA" or "AoS" template parameters. My questions are: does such thing exist somewhere? And if it doesn't, how would it be implemented at best?

解决方案

I implemented a generic solution, I'll explain it here below (it will be a long post). This is not the only possible answer of course, and it'd be great to gather feedback. I placed the full code of this solution here https://github.com/crosetto/SoAvsAoS

We create two helper classes which given an item generate the container type as a vector of tuples or a tuple of vectors, depending on a tag template argument. We call this class a DataLayoutPolicy and we are going to use it e.g. in this way:

DataLayoutPolicy<std::vector, SoA, char, double, std::string>

to generate a tuple of vectors of char, int and double.

enum class DataLayout { SoA, //structure of arrays
                        AoS //array of structures
};
template <template <typename...> class Container, DataLayout TDataLayout, typename TItem>
struct DataLayoutPolicy;

This class will only contain static member functions to interact with the container (e.g. extract an element, insert, resize, etc...). We write two template specializations. The first one (trivial) for the array of structures situation:

template <template <typename...> class Container, template<typename...> class TItem, typename... Types>
struct DataLayoutPolicy<Container, DataLayout::AoS, TItem<Types...>> {
    using type = Container<TItem<Types...>>;
    using value_type = TItem<Types...>&;

    constexpr static value_type get( type& c_, std::size_t position_ ){ return value_type(*static_cast<TItem<Types...>*>(&c_[ position_ ])); }

    constexpr static void resize( type& c_, std::size_t size_ ) { c_.resize( size_ ); }

    template <typename TValue>
    constexpr static void push_back( type& c_, TValue&& val_ ){ c_.push_back( val_ ); }
    static constexpr std::size_t size(type& c_){ return  c_.size(); }
};

... just forwarding. We do the same for the structure of arrays case.

Note: there are a few things to be explained about the code below.

It wraps all the types in a ref_wrap type, which is a "decorated" std::reference_wrapper. This is because we want to access the elements as lvalue references, to be able to change their values. using a regular reference we would be in trouble if e.g. Types contains any reference. One thing worth noticing is that in the AoS case the DataLayoutPolicy::value_type is a reference, while in the SoA case is the value of a ref_wrap type.

we return by value a newly created tuple of ref_wrap of the values. This is surpisingly OK, because the compiler is optimizing all copies away, and it is even more OK in C++17 (the returned tuple is a 'prvalue'), because of the guaranteed copy elision added to the standard: the tuple is not copied, this code would work even if std::tuple and std::reference_wrapper didn't have a copy/move constructor.

we use a std::integer sequence to statically unroll a parameter pack: this is ugly but it is "the way" to do it since C++14 (and in C++11 one had to use template recursion to achieve the same). There is not yet such thing like a "for_each" for parameter packs.

we use C++17 fold expressions to call a function returning void multiple times. Before C++17 this was achieved concisely with tricky hacks.

template <typename T>
struct ref_wrap : public std::reference_wrapper<T>{
    operator T&() const noexcept { return this->get(); }
    ref_wrap(T& other_) : std::reference_wrapper<T>(other_){}
    void operator =(T && other_) {this->get()=other_;}
};

template <template <typename...> class Container, template<typename...> class TItem, typename... Types>
struct DataLayoutPolicy<Container, DataLayout::SoA, TItem<Types...>> {
    using type = std::tuple<Container<Types>...>;
    using value_type = TItem<ref_wrap<Types>...>;

    constexpr static value_type get( type& c_, std::size_t position_ )
    {
        return doGet( c_, position_, std::make_integer_sequence<unsigned, sizeof...( Types )>() ); // unrolling parameter pack
    }

    constexpr static void resize( type& c_, std::size_t size_ ) {
        doResize( c_, size_, std::make_integer_sequence<unsigned, sizeof...( Types )>() ); // unrolling parameter pack
    }

    template <typename TValue>
    constexpr static void push_back( type& c_, TValue&& val_ ){
        doPushBack( c_, std::forward<TValue>(val_), std::make_integer_sequence<unsigned, sizeof...( Types )>() ); // unrolling parameter pack
    }

    static constexpr std::size_t size(type& c_){ return std::get<0>( c_ ).size(); }

    private:

    template <unsigned... Ids>
    constexpr static auto doGet( type& c_, std::size_t position_, std::integer_sequence<unsigned, Ids...> )
    {
        return value_type{ ref_wrap( std::get<Ids>( c_ )[ position_ ] )... }; // guaranteed copy elision
    }

    template <unsigned... Ids>
    constexpr static void doResize( type& c_, unsigned size_, std::integer_sequence<unsigned, Ids...> )
    {
        ( std::get<Ids>( c_ ).resize( size_ ), ... ); //fold expressions
    }

    template <typename TValue, unsigned... Ids>
    constexpr static void doPushBack( type& c_, TValue&& val_, std::integer_sequence<unsigned, Ids...> )
    {
        ( std::get<Ids>( c_ ).push_back( std::get<Ids>( std::forward<TValue>( val_ ) ) ), ... ); // fold expressions
    }
};

So now this code shows pretty clearly how this abstraction can be built. We show below a possible strategy to use it. We define the policy_t type using the DataLayoutPolicy and a generic TItem type

template <template <typename T> class TContainer, DataLayout TDataLayout, typename TItem>
using policy_t = DataLayoutPolicy<TContainer, TDataLayout, TItem>;

The container class forwards most of the calls to the static functions defined by the policy_t type. It might look like the following

template <template <typename ValueType> class TContainer, DataLayout TDataLayout, typename TItem>
struct BaseContainer
{
    /*member functions like puhs_back, resize,...*/
    value_type operator[]( std::size_t position_ )
    {
            return policy_t::get( mValues, position_ );
    }

    iterator       begin() { return iterator( this, 0 ); }
    iterator       end() { return iterator( this, size() ); }

    private:

    typename policy_t::type mValues;

};

Now this is no standard container, so we have to define an iterator in order to use it whithin STL algorithms. The iterator we build looks like a STL iterator for a container of tuple, except for the fact that it must hold a reference to the container, because when we call the dereference operator we want to call our storage's operator[], which statically dispatches the operation using the container's data layout policy.

template <typename  TContainer>
class Iterator
{

private:
    using container_t = TContainer;
public:

    /* ... usual iterator member functions and type definitions ...*/

    template<typename TTContainer>
    Iterator( TTContainer* container_, std::size_t position_ = 0 ):
        mContainer( container_ )
        , mIterPosition( position_ )
    {
    }

    value_type operator*() {
        return (*mContainer)[ mIterPosition ];
    }

    private:
    container_t*        mContainer = nullptr;
    std::size_t         mIterPosition = std::numeric_limits<std::size_t>::infinity();
};

Eventually we define our "item" data structure: we make it a decorator of a std::tuple, with some specific member functions (in this case only getters/setters).

template<typename ... T>
struct Item : public std::tuple<T ...>{
    using std::tuple<T...>::tuple;
    auto & myDouble(){return std::get<0>(*this);}
    auto & myChar()  {return std::get<1>(*this);}
    auto & myString(){return std::get<2>(*this);}
};

When we call Item's member functions we have to rely on compiler optimization in order for our abstraction to be "zero-cost": we don't want to call the Item constructor, because we are creating a temporary tuple just to access one of it's members each time and then we thrash it right away.

so eventually we can write the program:

template<typename T>
using MyVector = std::vector<T, std::allocator<T>>;

int main(int argc, char** argv){
using container_t = BaseContainer<MyVector, DataLayout::SoA, Item<double, char, std::string, Pad> >;
container_t container_(1000);

 for(auto&& i : container_){
    i.myDouble()=static_cast<double>(argc);
}

and we can write generic and efficient code regardless of the memory layout underneath. What's left to do is to check that this is a zero cost abstraction. The easiest way for me to check that is using a debugger: compile the example with debug symbols on,

> clang++ -std=c++1z -O3 -g main.cpp -o test

run it with gdb, set a brakpoint in the for loop, and step through the assembly instructions (the layout split command shows the source code and disassembled instructions at the same time)

> gdb test
(gdb) break main.cpp : 10 # set breakpoint inside the loop
(gdb) run # execute until the breakpoint
(gdb) layout split # show assembly and source code in 2 separate frames
(gdb) stepi # execute one instruction

The instructions being executed inside the loop are in case of AoS data layout

0x400b00 <main(int, char**)+192>        movsd  %xmm0,(%rsi)
0x400b04 <main(int, char**)+196>        add    $0x610,%rsi
0x400b0b <main(int, char**)+203>        add    $0xffffffffffffffff,%rcx
0x400b0f <main(int, char**)+207>        jne    0x400b00 <main(int, char**)+192>

Notice in particular that in the second line the offset being add to compute the address is 0x160. This changes depending on the size of the data members in the item object. On the other hand for the SoA data structure we have

0x400b60 <main(int, char**)+224>        movups %xmm1,(%rdi,%rsi,8)
0x400b64 <main(int, char**)+228>        movups %xmm1,0x10(%rdi,%rsi,8)
0x400b69 <main(int, char**)+233>        movups %xmm1,0x20(%rdi,%rsi,8)
0x400b6e <main(int, char**)+238>        movups %xmm1,0x30(%rdi,%rsi,8)
0x400b73 <main(int, char**)+243>        movups %xmm1,0x40(%rdi,%rsi,8)
0x400b78 <main(int, char**)+248>        movups %xmm1,0x50(%rdi,%rsi,8)
0x400b7d <main(int, char**)+253>        movups %xmm1,0x60(%rdi,%rsi,8)
0x400b82 <main(int, char**)+258>        movups %xmm1,0x70(%rdi,%rsi,8)
0x400b87 <main(int, char**)+263>        movups %xmm1,0x80(%rdi,%rsi,8)
0x400b8f <main(int, char**)+271>        movups %xmm1,0x90(%rdi,%rsi,8)
0x400b97 <main(int, char**)+279>        movups %xmm1,0xa0(%rdi,%rsi,8)
0x400b9f <main(int, char**)+287>        movups %xmm1,0xb0(%rdi,%rsi,8)
0x400ba7 <main(int, char**)+295>        movups %xmm1,0xc0(%rdi,%rsi,8)
0x400baf <main(int, char**)+303>        movups %xmm1,0xd0(%rdi,%rsi,8)
0x400bb7 <main(int, char**)+311>        movups %xmm1,0xe0(%rdi,%rsi,8)
0x400bbf <main(int, char**)+319>        movups %xmm1,0xf0(%rdi,%rsi,8)
0x400bc7 <main(int, char**)+327>        add    $0x20,%rsi
0x400bcb <main(int, char**)+331>        add    $0x8,%rbx
0x400bcf <main(int, char**)+335>        jne    0x400b60 <main(int, char**)+224>

We see the loop is unrolled and vectorized by Clang (version 6.0.0), and the increment for the address is 0x20, independent of the number of data members present in the item struct.

这篇关于SoA / AoS内存布局的C ++零成本抽象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆