非连续数组的块状视图连续部分为较大大小的dtype [英] Numpy view contiguous part of non-contiguous array as dtype of bigger size
问题描述
我试图从超长字符数组生成三元组数组(即连续三个字母的组合):
I was trying to generate an array of trigrams (i.e. continuous-three-letter combinations) from a super long char array:
# data is actually load from a source file
a = np.random.randint(0, 256, 2**28, 'B').view('c')
由于复制效率不高(并且会产生诸如高速缓存未命中之类的问题),因此我使用跨步技巧直接生成了Trigram:
Since making copy is not efficient (and it creates problems like cache miss), I directly generated the trigram using stride tricks:
tri = np.lib.stride_tricks.as_strided(a, (len(a)-2,3), a.strides*2)
这会生成形状为(2 ** 28-2,3)
的三字母组合列表,其中每一行都是一个三字母组合.现在,我想将Trigram转换为字符串列表(即 S3
),以便numpy更加合理地"显示它(而不是单个字符).
This generates a trigram list with shape (2**28-2, 3)
where each row is a trigram. Now I want to convert the trigram to a list of string (i.e. S3
) so that numpy displays it more "reasonably" (instead of individual chars).
tri = tri.view('S3')
它给出了例外:
ValueError: To change to a dtype of a different size, the array must be C-contiguous
我理解通常为了创建有意义的视图,数据应该是连续的,但是此数据在应该存在的位置"处是连续的:每个三个元素都是连续的.
I understand generally data should be contiguous in order to create a meaningful view, but this data is contiguous at "where it should be": each three elements are contiguous.
所以我想知道如何以更大尺寸的dtype view
非连续的 np.ndarray
中的连续部分?更加标准"的方式会更好,同时也欢迎使用骇客的方式.看来我可以使用 np.lib.stride_tricks.as_strided
自由设置 shape
和 stride
,但是我不能强迫dtype
之类的东西,这就是问题所在.
So I'm wondering how to view
contiguous part in non-contiguous np.ndarray
as dtype of bigger size? A more "standard" way would be better, while hackish ways are also welcome. It seems that I can set shape
and stride
freely with np.lib.stride_tricks.as_strided
, but I can't force the dtype
to be something, which is the problem here.
编辑
非连续数组可以通过简单切片来制作.例如:
Non-contiguous array can be made by simple slicing. For example:
np.empty((8, 4), 'uint32')[:, :2].view('uint64')
将在上面抛出相同的异常(从内存的角度来看,我应该能够做到这一点).这种情况比上面的示例更为常见.
will throw the same exception above (while from a memory point of view I should be able to do this). This case is much more common than my example above.
推荐答案
如果您有权访问派生非连续数组的连续数组,通常应该可以解决此限制.
If you have access to a contiguous array from which your non-contiguous one is derived, it should typically be possible to work around this limitation.
例如,您的三字母组可以这样获得:
For example your trigrams can be obtained like so:
>>> a = np.random.randint(0, 256, 2**28, 'B').view('c')
>>> a
array([b')', b'\xf2', b'\xf7', ..., b'\xf4', b'\xf1', b'z'], dtype='|S1')
>>> np.lib.stride_tricks.as_strided(a[:0].view('S3'), ((2**28)-2,), (1,))
array([b')\xf2\xf7', b'\xf2\xf7\x14', b'\xf7\x14\x1b', ...,
b'\xc9\x14\xf4', b'\x14\xf4\xf1', b'\xf4\xf1z'], dtype='|S3')
实际上,此示例表明,我们所需的只是在内存缓冲区底部的一个连续存根"以进行视图转换,因为此后,因为 as_strided
不会进行很多检查,因此我们基本上可以自由进行做我们喜欢的事.
In fact, this example demonstrates that all we need is a contiguous "stub" at the memory buffer's base for view casting, since afterwards, because as_strided
does not do many checks we are essentially free to do whatever we like.
似乎我们总是可以通过切片大小为0的数组来获得这样的存根.对于第二个示例:
It seems we can always get such a stub by slicing to a size 0 array. For your second example:
>>> X = np.empty((8, 4), 'uint32')[:, :2]
>>> np.lib.stride_tricks.as_strided(X[:0].view(np.uint64), (8, 1), X.strides)
array([[140133325248280],
[ 32],
[ 32083728],
[ 31978800],
[ 0],
[ 29686448],
[ 32],
[ 32362720]], dtype=uint64)
这篇关于非连续数组的块状视图连续部分为较大大小的dtype的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!