Cython指定固定长度字符串的numpy数组 [英] Cython specify numpy array of fixed length strings
问题描述
我有一个想与Cython一起使用的函数,该函数涉及处理大量固定长度的字符串.对于标准的cython函数,我可以像这样声明数组的类型:
I have a function that I'd like to use Cython with that involves processing large numbers of fixed-length strings. For a standard cython function, I can declare the types of arrays like so:
cpdef double[:] g(double[:] in_arr):
cdef double[:] out_arr = np.zeros(in_arr.shape, dtype='float64')
cdef i
for i in range(len(in_arr)):
out_arr[i] = in_arr[i]
return out_arr
当dtype很简单,例如int32
,float
,double
等时,它将编译并按预期工作.但是,我无法弄清楚如何创建固定长度字符串的带类型的memoryview-即等价于np.dtype('a5')
.
This compiles and works as expected when the dtype is something simple like int32
, float
, double
, etc. However, I cannot figure out how to create a typed memoryview of fixed-length strings - i.e. the equivalent of np.dtype('a5')
, for example.
如果我使用这个:
cpdef str[:] f(str[:] in_arr):
# arr should be a numpy array of 5-character strings
cdef str[:] out_arr = np.zeros(in_arr.shape, dtype='a5')
cdef i
for i in range(len(in_arr)):
out_arr[i] = in_arr[i]
return out_arr
该函数可以编译,但这:
The function compiles, but this:
in_arr = np.array(['12345', '67890', '22343'], dtype='a5')
f(in_arr)
引发以下错误:
---> 16 cpdef str [:] f(str [:] in_arr): 17#arr应该是由5个字符组成的字符串的numpy数组 18 cdef str [:] out_arr = np.zeros(in_arr.shape,dtype ='a5')
---> 16 cpdef str[:] f(str[:] in_arr): 17 # arr should be a numpy array of 5-character strings 18 cdef str[:] out_arr = np.zeros(in_arr.shape, dtype='a5')
ValueError:缓冲区dtype不匹配,预期为"unicode对象",但出现了 字符串
ValueError: Buffer dtype mismatch, expected 'unicode object' but got a string
类似地,如果我使用bytes[:]
,则会出现错误缓冲区dtype不匹配,预期为'字节对象',但有字符串"-甚至没有出现这样的问题,即我无处指定这些字符串的长度为6.
Similarly if I use bytes[:]
, it gives the error "Buffer dtype mismatch, expected 'bytes object' but got a string" - and this doesn't even get to the issue with the fact that nowhere am I specifying that these strings have length 6.
有趣的是,我可以以结构化类型包含定长字符串,如此问题,但我认为这不是声明类型的正确方法.
Interestingly, I can include fixed-length strings in a structured type as in this question, but I don't think that's the right way to declare the types.
推荐答案
在Python3会话中,您的a5
数组包含字节串.
In a Python3 session, your a5
array contains bytestrings.
In [165]: np.array(['12345', '67890', '22343'], dtype='a5')
Out[165]:
array([b'12345', b'67890', b'22343'],
dtype='|S5')
http://cython.readthedocs.io/en/latest /src/tutorial/strings.html
说str
是使用Python3编译时的unicode字符串类型.
http://cython.readthedocs.io/en/latest/src/tutorial/strings.html
says that str
is unicode string type when compiled with Python3.
我怀疑np.array(['12345', '67890', '22343'], dtype='U5')
将被接受为您的函数的输入数组.但是复制到a5
out_arr
会出现问题.
I suspect that np.array(['12345', '67890', '22343'], dtype='U5')
would be accepted as the input array for your function. But copying to the a5
out_arr
would have problems.
此循环的对象版本有效:
An object version of this loop works:
cpdef str[:] objcopy(str[:] in_arr):
cdef str[:] out_arr = np.zeros(in_arr.shape[0], dtype=object)
cdef int N
N = in_arr.shape[0]
for i in range(N):
out_arr[i] = in_arr[i]
return out_arr
narr = np.array(['one','two','three'], dtype=object)
cpy = objcopy(narr)
print(cpy)
print(np.array(cpy))
print(np.array(objcopy(np.array([None,'one', 23.4]))))
这些函数返回一个memoryview,必须将其转换为数组才能打印.
These functions return a memoryview, which has to be converted to array to print.
单字节memoryview副本:
Single byte memoryview copy:
cpdef char[:] chrcopy(char[:] in_arr):
cdef char[:] out_arr = np.zeros(in_arr.shape[0], dtype='uint8')
cdef int N
N = in_arr.shape[0]
for i in range(N):
out_arr[i] = in_arr[i]
return out_arr
print(np.array(chrcopy(np.array([b'one',b'two',b'three']).view('S1'))).view('S5'))
使用view
将字符串转换为单字节并返回.
Uses view
to convert strings to single bytes and back.
我去年调查了这个问题: Cython:将unicode存储在numpy数组中
I looked into this issue last year: Cython: storing unicode in numpy array
这将处理unicode字符串,就像它们是2d int数组的行一样;之前和之后都需要重塑.
This processes unicode strings as though they were rows of a 2d int array; reshape is needed before and after.
cpdef int[:,:] int2dcopy(int[:,:] in_arr):
cdef int[:,:] out_arr = np.zeros((in_arr.shape[0], in_arr.shape[1]), dtype=int)
cdef int N
N = in_arr.shape[0]
for i in range(N):
out_arr[i,:] = in_arr[i,:]
return out_arr
narr = np.array(['one','two','three', 'four', 'five'], dtype='U5')
cpy = int2dcopy(narr.view('int').reshape(-1,5))
print(cpy)
print(np.array(cpy))
print(np.array(cpy).view(narr.dtype)) # .reshape(-1)
对于字节串,应该使用类似的2d char
版本.
For bytestrings a similar 2d char
version should work.
byte5 = cython.struct(x=cython.char[5])
cpdef byte5[:] byte5copy(byte5[:] in_arr):
cdef byte5[:] out_arr = np.zeros(in_arr.shape[0], dtype='|S5')
cdef int N
N = in_arr.shape[0]
for i in range(N):
out_arr[i] = in_arr[i]
return out_arr
narr = np.array(['one','four','six'], dtype='|S5')
cpy = byte5copy(narr)
print(cpy)
print(repr(np.array(cpy)))
# array([b'one', b'four', b'six'], dtype='|S5')
C结构正在创建一个具有5个字节元素的memoryview,这些元素映射到数组S5
元素上.
The C struct is creating a memoryview with 5 byte elements, which map onto array S5
elements.
https://github.com/cython/cython /blob/master/tests/memoryview/numpy_memoryview.pyx 还有一个带有字节串的结构化数组示例.
https://github.com/cython/cython/blob/master/tests/memoryview/numpy_memoryview.pyx also has a structured array example with bytestrings.
这篇关于Cython指定固定长度字符串的numpy数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!