Cython:将unicode存储在numpy数组中 [英] Cython: storing unicode in numpy array

查看:122
本文介绍了Cython:将unicode存储在numpy数组中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是cython的新手,而且我经常遇到涉及在numpy数组内部编码unicode的问题.

I'm new to cython, and I've been having a re-ocurring problem involving encoding unicode inside of a numpy array.

这是问题的一个示例:

import numpy as np
cimport numpy as np

cpdef pass_array(np.ndarray[ndim=1,dtype=np.unicode] a):
    pass

cpdef access_unicode_item(np.ndarray a):
    cdef unicode item = a[0]

示例错误:

In [3]: unicode_array = np.array([u"array",u"of",u"unicode"],dtype=np.unicode)

In [4]: pass_array(unicode_array)
ValueError: Does not understand character buffer dtype format string ('w')

In [5]: access_item(unicode_array)
TypeError: Expected unicode, got numpy.unicode_

问题似乎是值不是真正的unicode,而是numpy.unicode_.有没有一种方法可以将数组中的值编码为适当的unicode(以便我可以键入用于cython代码的单个项目)?

The problem seems to be that the values are not real unicode, but instead numpy.unicode_ . Is there a way to encode the values in the array as proper unicode (so that I can type individual items for use in cython code)?

推荐答案

在Py2.7中

In [375]: arr=np.array([u"array",u"of",u"unicode"],dtype=np.unicode)

In [376]: arr
Out[376]: 
array([u'array', u'of', u'unicode'], 
      dtype='<U7')

In [377]: arr.dtype
Out[377]: dtype('<U7')

In [378]: type(arr[0])
Out[378]: numpy.unicode_

In [379]: type(arr[0].item())
Out[379]: unicode

通常,x[0]在numpy子类中返回x的元素.在这种情况下,np.unicode_unicode的子类.

In general x[0] returns an element of x in a numpy subclass. In this case np.unicode_ is a subclass of unicode.

In [384]: isinstance(arr[0],np.unicode_)
Out[384]: True

In [385]: isinstance(arr[0],unicode)
Out[385]: True

我认为您在np.int32int之间会遇到同样的问题.但是我不能确定如何使用cython.

I think you'd encounter the same sort of issues between np.int32 and int. But I haven't worked enough with cython to be sure.

您在哪里看到过cython代码指定了字符串(unicode或字节)dtype?

Where have you seen cython code that specifies a string (unicode or byte) dtype?

http://docs.cython.org/src/tutorial/numpy.html 具有类似

# We now need to fix a datatype for our arrays. I've used the variable
# DTYPE for this, which is assigned to the usual NumPy runtime
# type info object.
DTYPE = np.int
# "ctypedef" assigns a corresponding compile-time type to DTYPE_t. For
# every type in the numpy module there's a corresponding compile-time
# type with a _t-suffix.
ctypedef np.int_t DTYPE_t
....
def naive_convolve(np.ndarray[DTYPE_t, ndim=2] f):

[]部分的目的是提高索引编制效率.

The purpose of the [] part is to improve indexing efficiency.

然后我们需要做的是键入ndarray对象的内容.我们使用特殊的缓冲区"语法来做到这一点,必须告知数据类型(第一个参数)和维数("ndim"仅关键字参数,如果未提供,则假定为一维).

What we need to do then is to type the contents of the ndarray objects. We do this with a special "buffer" syntax which must be told the datatype (first argument) and number of dimensions ("ndim" keyword-only argument, if not provided then one-dimensional is assumed).

我认为np.unicode不会有所帮助,因为它未指定字符长度.完整字符串dtype必须包含字符数,例如.在我的示例中为<U7.

I don't think np.unicode will help because it doesn't specify character length. The full string dtype has to include the number of characters, eg. <U7 in my example.

我们需要在cython文档或其他SO cython问题中找到传递字符串数组的工作示例.

We need to find working examples which pass string arrays - either in the cython documentation or other SO cython questions.

对于某些操作,您可以将unicode数组视为int32的数组.

For some operations, you could treat the unicode array as an array of int32.

In [397]: arr.nbytes
Out[397]: 84

3个字符串x 7个字符/字符串* 4个字节/字符

3 strings x 7 char/string * 4bytes/char

In [398]: arr.view(np.int32).reshape(-1,7)
Out[398]: 
array([[ 97, 114, 114,  97, 121,   0,   0],
       [111, 102,   0,   0,   0,   0,   0],
       [117, 110, 105,  99, 111, 100, 101]])

当您可以绕过Python函数和方法时,Cython可以最大程度地提高速度.那将包括绕过许多Python字符串和unicode功能.

Cython gives you the greatest speed improvement when you can bypass Python functions and methods. That would include bypassing much of the Python string and unicode functionality.

这篇关于Cython:将unicode存储在numpy数组中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆