在字符串之间转换代码点的numpy数组 [英] Converting numpy arrays of code points to and from strings

查看:39
本文介绍了在字符串之间转换代码点的numpy数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很长的unicode字符串:

I have a long unicode string:

alphabet = range(0x0FFF)
mystr = ''.join(chr(random.choice(alphabet)) for _ in range(100))
mystr = re.sub('\W', '', mystr)

我想将其视为一系列代码点,所以目前,我正在执行以下操作:

I would like to view it as a series of code points, so at the moment, I am doing the following:

arr = np.array(list(mystr), dtype='U1')

我希望能够将字符串作为数字来处理,并最终获得一些不同的代码点.现在,我想反转转换:

I would like to be able to manipulate the string as numbers, and eventually get some different code points back. Now I'd like to invert the transformation:

mystr = ''.join(arr.tolist())

这些转换相当快且可逆,但是在list中介中占用了不必要的空间.

These transformations are reasonably fast and invertible, but take up an unnecessary amount of space with the list intermediary.

是否有一种方法可以将numpy的Unicode字符数组与Python字符串进行相互转换而无需先转换为列表?

Is there a way to convert a numpy array of unicode characters to and from a Python string without converting to a list first?

事后

我可以使arr像单个字符串一样显示

I can get arr to appear as a single string with something like

buf = arr.view(dtype='U' + str(arr.size))

这将导致包含整个原始图元的1元素数组.反之亦然:

This results in a 1-element array containing the entire original. The inverse is possible as well:

buf.view(dtype='U1')

唯一的问题是结果的类型是np.str_,而不是str.

The only issue is that the type of the result is np.str_, not str.

推荐答案

fromiter可以工作,但是速度很慢,因为它要通过迭代器协议.将数据编码为UTF-32(以系统字节顺序)并使用 numpy.frombuffer :

fromiter works, but is really slow, since it goes through the iterator protocol. It's much faster to encode your data to UTF-32 (in system byte order) and use numpy.frombuffer:

In [56]: x = ''.join(chr(random.randrange(0x0fff)) for i in range(1000))

In [57]: codec = 'utf-32-le' if sys.byteorder == 'little' else 'utf-32-be'

In [58]: %timeit numpy.frombuffer(bytearray(x, codec), dtype='U1')
2.79 µs ± 47 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [59]: %timeit numpy.fromiter(x, dtype='U1', count=len(x))
122 µs ± 3.82 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [60]: numpy.array_equal(numpy.fromiter(x, dtype='U1', count=len(x)), numpy.fr
    ...: ombuffer(bytearray(x, codec), dtype='U1'))
Out[60]: True

我已经使用sys.byteorder来确定是使用utf-32-le还是utf-32-be进行编码.另外,使用bytearray而不是encode会得到一个可变的字节数组,而不是一个不可变的字节对象,因此生成的数组是可写的.

I've used sys.byteorder to determine whether to encode in utf-32-le or utf-32-be. Also, using bytearray instead of encode gets a mutable bytearray instead of an immutable bytes object, so the resulting array is writable.

对于反向转换,arr.view(dtype=f'U{arr.size}')[0]可以使用,但是使用

As for the reverse conversion, arr.view(dtype=f'U{arr.size}')[0] works, but using item() is a bit faster and produces an ordinary string object, avoiding possible weird edge cases where numpy.str_ doesn't quite behave like str:

In [72]: a = numpy.frombuffer(bytearray(x, codec), dtype='U1')

In [73]: type(a.view(dtype=f'U{a.size}')[0])
Out[73]: numpy.str_

In [74]: type(a.view(dtype=f'U{a.size}').item())
Out[74]: str

In [75]: %timeit a.view(dtype=f'U{a.size}')[0]
3.63 µs ± 34 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [76]: %timeit a.view(dtype=f'U{a.size}').item()
2.14 µs ± 23.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


最后,请注意,NumPy不像常规Python字符串对象那样处理空值. NumPy无法区分'asdf\x00\x00\x00''asdf',因此,如果您的数据可能包含空代码点,则将NumPy数组用于字符串操作是不安全的.


Finally, be aware that NumPy doesn't handle nulls like normal Python string objects do. NumPy can't distinguish between 'asdf\x00\x00\x00' and 'asdf', so using NumPy arrays for string operations is not safe if your data may contain null code points.

这篇关于在字符串之间转换代码点的numpy数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆