Numpy中的向量化字符串操作:为什么它们比较慢? [英] Vectorized string operations in Numpy: why are they rather slow?

查看:88
本文介绍了Numpy中的向量化字符串操作:为什么它们比较慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是那些出于纯粹的好奇心而提出的(可能徒劳的希望我会学到一些东西)"的问题.

This is of those "mostly asked out of pure curiosity (in possibly futile hope I will learn something)" questions.

我正在研究在大量字符串上节省操作内存的方法,对于某些方案,似乎

I was investigating ways of saving memory on operations on massive numbers of strings, and for some scenarios it seems like string operations in numpy could be useful. However, I got somewhat surprising results:

import random
import string

milstr = [''.join(random.choices(string.ascii_letters, k=10)) for _ in range(1000000)]

npmstr = np.array(milstr, dtype=np.dtype(np.unicode_, 1000000))

使用memory_profiler的内存消耗:

%memit [x.upper() for x in milstr]
peak memory: 420.96 MiB, increment: 61.02 MiB

%memit np.core.defchararray.upper(npmstr)
peak memory: 391.48 MiB, increment: 31.52 MiB

到目前为止,太好了;但是,计时结果对我来说是令人惊讶的:

So far, so good; however, timing results are surprising for me:

%timeit [x.upper() for x in milstr]
129 ms ± 926 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np.core.defchararray.upper(npmstr)
373 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

那是为什么?我预计,由于Numpy对其数组使用连续的内存块,并且对其向量进行矢量化处理(如上述numpy doc页面所述),并且numpy字符串数组显然使用较少的内存,因此对其进行的操作至少应该可能在CPU上具有更多的高速缓存,友好,字符串数组的性能至少与纯Python中的性能相似?

Why is that? I expected that since Numpy uses contiguous chunks of memory for its arrays AND its operations are vectorized (as the above numpy doc page says) AND numpy string arrays apparently use less memory so operating on them should at least potentially be more on-CPU cache-friendly, performance on arrays of strings would be at least similar to those in pure Python?

环境:

Python 3.6.3 x64,Linux

Python 3.6.3 x64, Linux

numpy == 1.14.1

numpy==1.14.1

推荐答案

在谈论numpy时,矢量化有两种用法,但并不总是清楚这是什么意思.

Vectorized is used in two ways when talking about numpy, and it`s not always clear which is meant.

  1. 对数组的所有元素进行操作的操作
  2. 内部调用优化的(在许多情况下为多线程)数字代码的操作

第二点是使向量化操作比python中的for循环快得多的原因,而多线程部分则使向量化操作比列表理解要快. 当这里的评论者指出矢量化代码更快时,他们也指的是第二种情况. 但是,在numpy文档中,矢量化仅指第一种情况. 这意味着您可以直接在数组上使用函数,而不必遍历所有元素并在每个元素上调用它. 从这个意义上讲,它使代码更简洁,但不一定更快. 一些矢量化操作确实会调用多线程代码,但据我所知,这仅限于线性代数例程. 就个人而言,我更喜欢使用向量化运算,因为即使性能相同,我也认为它比列表理解更具可读性.

The second point is what makes vectorized operations much faster than a for loop in python, and the multithreaded part is what makes them faster than a list comprehension. When commenters here state that vectorized code is faster, they're referring to the second case as well. However, in the numpy documentation, vectorized only refers to the first case. It means you can use a function directly on an array, without having to loop through all the elements and call it on each elements. In this sense it makes code more concise, but isn't necessarily any faster. Some vectorized operations do call multithreaded code, but as far as I know this is limited to linear algebra routines. Personally, I prefer using vectorized operatios since I think it is more readable than list comprehensions, even if performance is identical.

现在,对于有问题的代码,np.char(这是np.core.defchararray的别名)文档的状态为

Now, for the code in question the documentation for np.char (which is an alias for np.core.defchararray), states

chararray类的存在是为了向后兼容 Numarray,不建议用于新开发.从numpy开始 1.4,如果需要一个字符串数组,建议使用 dtype object_string_unicode_,并使用免费功能 在numpy.char模块中进行快速矢量化字符串操作.

The chararray class exists for backwards compatibility with Numarray, it is not recommended for new development. Starting from numpy 1.4, if one needs arrays of strings, it is recommended to use arrays of dtype object_, string_ or unicode_, and use the free functions in the numpy.char module for fast vectorized string operations.

因此,有四种方法(不推荐一种)来处理numpy中的字符串. 必须进行一些测试,因为肯定每种方法都会有不同的优点和缺点. 使用定义如下的数组:

So there are four ways (one not recommended) to handle strings in numpy. Some testing is in order, since certainly each way will have different advantages and disadvantages. Using arrays defined as follows:

npob = np.array(milstr, dtype=np.object_)
npuni = np.array(milstr, dtype=np.unicode_)
npstr = np.array(milstr, dtype=np.string_)
npchar = npstr.view(np.chararray)
npcharU = npuni.view(np.chararray)

这将创建具有以下数据类型的数组(或后两个字符数组):

This creates arrays (or chararrays for the last two) with the following datatypes:

In [68]: npob.dtype
Out[68]: dtype('O')

In [69]: npuni.dtype
Out[69]: dtype('<U10')

In [70]: npstr.dtype
Out[70]: dtype('S10')

In [71]: npchar.dtype
Out[71]: dtype('S10')

In [72]: npcharU.dtype
Out[72]: dtype('<U10')

在这些数据类型中,基准测试提供了相当范围的性能:

The benchmarks give quite a range of performance across these datatypes:

%timeit [x.upper() for x in test]
%timeit np.char.upper(test)

# test = milstr
103 ms ± 1.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
377 ms ± 3.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# test = npob
110 ms ± 659 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
<error on second test, vectorized operations don't work with object arrays>

# test = npuni
295 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
323 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# test = npstr
125 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
125 ms ± 483 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# test = npchar
663 ms ± 4.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
127 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# test = npcharU
887 ms ± 8.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
325 ms ± 3.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

令人惊讶的是,使用普通的旧字符串列表仍然是最快的. 当数据类型为string_object_时,Numpy具有竞争性,但是一旦包含unicode,性能就会变得更差. chararray到目前为止是最慢的,无论是否处理unicode. 应该清楚为什么不建议使用它.

Surprisingly, using a plain old list of strings is still the fastest. Numpy is competitive when the datatype is string_ or object_, but once unicode is included performance becomes much worse. The chararray is by far the slowest, wether handling unicode or not. It should be clear why it's not recommended for use.

使用unicode字符串会严重影响性能. 文档指出以下差异在这些类型之间

Using unicode strings has a significant performance penalty. The docs state the following for differences between these types

为了与Python 2向后兼容,Sa类型字符串保留以零结尾的字节,并且np.string_继续映射到np.bytes_.要在Python 3中使用实际的字符串,请使用U或np.unicode_.对于不需要零终止的带符号字节,可以使用b或i1.

For backward compatibility with Python 2 the S and a typestrings remain zero-terminated bytes and np.string_ continues to map to np.bytes_. To use actual strings in Python 3 use U or np.unicode_. For signed bytes that do not need zero-termination b or i1 can be used.

在这种情况下,字符集不需要Unicode,则可以使用更快的string_类型. 如果需要unicode,则可以使用列表来获得更好的性能,如果需要其他numpy功能,则可以使用类型为object_的numpy数组. 列表何时可能更好的另一个很好的例子是

In this case, where the character set does not require unicode it would make sense to use the faster string_ type. If unicode was needed, you may get better performance by using a list, or a numpy array of type object_ if other numpy functionality is needed. Another good example of when a list may be better is appending lots of data

因此,要点是:

  1. Python虽然通常被认为是缓慢的,但在某些常见的事情上却表现出色. Numpy通常相当快,但并未针对所有功能进行优化.
  2. 阅读文档.如果做事的方法不只一种(通常是这样),那么对您尝试做的事情来说,赔率是一种更好的方法.
  3. 不要盲目地认为矢量化代码会更快-当您关心性能时始终进行剖析(这适用于任何优化"技巧).

这篇关于Numpy中的向量化字符串操作:为什么它们比较慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆