在字符串之间转换代码点的numpy数组 [英] Converting numpy arrays of code points to and from strings
问题描述
我有一个很长的unicode字符串:
I have a long unicode string:
alphabet = range(0x0FFF)
mystr = ''.join(chr(random.choice(alphabet)) for _ in range(100))
mystr = re.sub('\W', '', mystr)
我想将其视为一系列代码点,所以目前,我正在执行以下操作:
I would like to view it as a series of code points, so at the moment, I am doing the following:
arr = np.array(list(mystr), dtype='U1')
我希望能够将字符串作为数字来处理,并最终获得一些不同的代码点.现在,我想反转转换:
I would like to be able to manipulate the string as numbers, and eventually get some different code points back. Now I'd like to invert the transformation:
mystr = ''.join(arr.tolist())
这些转换相当快且可逆,但是在list
中介中占用了不必要的空间.
These transformations are reasonably fast and invertible, but take up an unnecessary amount of space with the list
intermediary.
是否有一种方法可以将numpy的Unicode字符数组与Python字符串进行相互转换而无需先转换为列表?
Is there a way to convert a numpy array of unicode characters to and from a Python string without converting to a list first?
事后
我可以使arr
像单个字符串一样显示
I can get arr
to appear as a single string with something like
buf = arr.view(dtype='U' + str(arr.size))
这将导致包含整个原始图元的1元素数组.反之亦然:
This results in a 1-element array containing the entire original. The inverse is possible as well:
buf.view(dtype='U1')
唯一的问题是结果的类型是np.str_
,而不是str
.
The only issue is that the type of the result is np.str_
, not str
.
推荐答案
fromiter
可以工作,但是速度很慢,因为它要通过迭代器协议.将数据编码为UTF-32(以系统字节顺序)并使用 numpy.frombuffer
:
fromiter
works, but is really slow, since it goes through the iterator protocol. It's much faster to encode your data to UTF-32 (in system byte order) and use numpy.frombuffer
:
In [56]: x = ''.join(chr(random.randrange(0x0fff)) for i in range(1000))
In [57]: codec = 'utf-32-le' if sys.byteorder == 'little' else 'utf-32-be'
In [58]: %timeit numpy.frombuffer(bytearray(x, codec), dtype='U1')
2.79 µs ± 47 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [59]: %timeit numpy.fromiter(x, dtype='U1', count=len(x))
122 µs ± 3.82 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [60]: numpy.array_equal(numpy.fromiter(x, dtype='U1', count=len(x)), numpy.fr
...: ombuffer(bytearray(x, codec), dtype='U1'))
Out[60]: True
我已经使用sys.byteorder
来确定是使用utf-32-le
还是utf-32-be
进行编码.另外,使用bytearray
而不是encode
会得到一个可变的字节数组,而不是一个不可变的字节对象,因此生成的数组是可写的.
I've used sys.byteorder
to determine whether to encode in utf-32-le
or utf-32-be
. Also, using bytearray
instead of encode
gets a mutable bytearray instead of an immutable bytes object, so the resulting array is writable.
对于反向转换,arr.view(dtype=f'U{arr.size}')[0]
可以使用,但是使用
As for the reverse conversion, arr.view(dtype=f'U{arr.size}')[0]
works, but using item()
is a bit faster and produces an ordinary string object, avoiding possible weird edge cases where numpy.str_
doesn't quite behave like str
:
In [72]: a = numpy.frombuffer(bytearray(x, codec), dtype='U1')
In [73]: type(a.view(dtype=f'U{a.size}')[0])
Out[73]: numpy.str_
In [74]: type(a.view(dtype=f'U{a.size}').item())
Out[74]: str
In [75]: %timeit a.view(dtype=f'U{a.size}')[0]
3.63 µs ± 34 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [76]: %timeit a.view(dtype=f'U{a.size}').item()
2.14 µs ± 23.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
最后,请注意,NumPy不像常规Python字符串对象那样处理空值. NumPy无法区分'asdf\x00\x00\x00'
和'asdf'
,因此,如果您的数据可能包含空代码点,则将NumPy数组用于字符串操作是不安全的.
Finally, be aware that NumPy doesn't handle nulls like normal Python string objects do. NumPy can't distinguish between 'asdf\x00\x00\x00'
and 'asdf'
, so using NumPy arrays for string operations is not safe if your data may contain null code points.
这篇关于在字符串之间转换代码点的numpy数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!