Apache箭头,对齐方式和填充 [英] Apache arrow, alignment and padding

查看:114
本文介绍了Apache箭头,对齐方式和填充的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用apache箭头,因为它使执行引擎可以利用现代处理器中包含的最新SIMD(单输入多个数据)操作,对分析数据处理进行本机矢量优化. ( https://arrow.apache.org/).

I want to use apache arrow because it enables execution engines to take advantage of the latest SIMD (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing. (https://arrow.apache.org/).

摘自文档记录( https://arrow.apache.org/docs/memory_layout.html ),我知道内存分配可以确保64字节对齐.

From documentration (https://arrow.apache.org/docs/memory_layout.html), I understand that memory allocation make sure about 64 byte alignment.

为了验证这64个字节的对齐方式,我使用numpy数组的__array_interface__数据成员,该成员指向存储数组内容的数据区域并在其上计算模64.如果结果为0,则内存地址至少对齐64个字节.

In order to verify this 64 bytes alignment, I use the __array_interface__ data member of a numpy array that points to the data-area storing the array contents and compute a modulo 64 on it. If the result is 0 then the memory address is aligned on at least 64 Bytes.

当我在下面的代码中执行代码时,在我的系统(Fedora)上它似乎可以工作(64模的结果为零),但是当我在同事的系统上(Fedora)执行相同的代码时,它也不起作用:模64的结果不为零.因此内存未按64个字节对齐.

When I execute the code bellow, on my system (Fedora) it seems to work (the result of modulo 64 is zero) but when I execute the same code on my colleague's system (Fedora too) it does not work: the result of modulo 64 is not zero. So the memory is not aligned on 64 bytes.

请在这里找到我的代码:

Please find my code here:

import pyarrow as pa

tab=pa.array([[1, 2], [3, 4]])

panda_array=tab.to_pandas()

print('numpy address {} modulo 64 => {}'.format(panda_array.__array_interface__['data'][0], panda_array.__array_interface__['data'][0]%64))

谢谢您的帮助.

推荐答案

Arrow中的内存是64字节对齐的,但是在您的示例代码中,到Pandas/NumPy的转换将数据的副本复制为列表的嵌套数组在Arrow和NumPy中以不同的方式表示.在Arrow中,使用一个缓冲区保存所有列表的数据来完成此操作,而使用另一个缓冲区保存该数组中每个列表的偏移量.由于NumPy没有本机列表类型,因此将其表示为NumPy数组,其中包含其他NumPy数组作为元素.这些在第一个NumPy数组中表示为Python对象.

The memory in Arrow is 64 byte aligned but in your example code, the conversion to Pandas/NumPy makes a copy of the data as a nested array of lists is differently represented in Arrow and in NumPy. In Arrow this is done using one buffer that holds the data of all lists while there is another buffer that holds the offsets for each list in that Array. As NumPy has no native list type, it is represented as a NumPy array that contains other NumPy arrays as elements. These are represented in the first NumPy array as Python objects.

因此,使用NumPy函数,您将看到内存是由NumPy分配的,而不是由Arrow分配的.因此,如果您的内存地址位于64字节边界上,那只是偶然.

Thus using the NumPy functions you see the memory as allocated by NumPy, not by Arrow. Thus if your memory address is on a 64 byte boundary, it is only by chance.

pyarrow的下一个版本(0.9)中,将有一个buffers属性来访问基础内存地址.然后,您应该能够直接检查Arrow存储器是否分配在64字节对齐的地址上(始终应该如此).

In the next version (0.9) of pyarrow there will be a buffers property to access the underlying memory addresses. You should then be able to directly check if the Arrow memory is allocated on a 64 byte aligned address (it always should be).

这篇关于Apache箭头,对齐方式和填充的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆