将一系列int转换为字符串-为什么应用比astype快得多? [英] Converting a series of ints to strings - Why is apply much faster than astype?
问题描述
我有一个包含整数的pandas.Series
,但是对于某些下游工具,我需要将它们转换为字符串.因此,假设我有一个Series
对象:
I have a pandas.Series
containing integers, but I need to convert these to strings for some downstream tools. So suppose I had a Series
object:
import numpy as np
import pandas as pd
x = pd.Series(np.random.randint(0, 100, 1000000))
在StackOverflow和其他网站上,我已经看到大多数人认为做到这一点的最佳方法是:
On StackOverflow and other websites, I've seen most people argue that the best way to do this is:
%% timeit
x = x.astype(str)
这大约需要2秒钟.
当我使用x = x.apply(str)
时,只需0.2秒.
When I use x = x.apply(str)
, it only takes 0.2 seconds.
为什么x.astype(str)
这么慢?推荐的方式应该是x.apply(str)
吗?
Why is x.astype(str)
so slow? Should the recommended way be x.apply(str)
?
我主要对此感兴趣的是python 3的行为.
I'm mainly interested in python 3's behavior for this.
推荐答案
性能
在开始任何调查之前,有必要先查看一下实际性能,因为与普遍看法相反,list(map(str, x))
似乎比x.apply(str)
慢 .
It's worth looking at actual performance before beginning any investigation since, contrary to popular opinion, list(map(str, x))
appears to be slower than x.apply(str)
.
import pandas as pd, numpy as np
### Versions: Pandas 0.20.3, Numpy 1.13.1, Python 3.6.2 ###
x = pd.Series(np.random.randint(0, 100, 100000))
%timeit x.apply(str) # 42ms (1)
%timeit x.map(str) # 42ms (2)
%timeit x.astype(str) # 559ms (3)
%timeit [str(i) for i in x] # 566ms (4)
%timeit list(map(str, x)) # 536ms (5)
%timeit x.values.astype(str) # 25ms (6)
值得注意的点:
- (5)比(3)/(4)快一点,我们期望随着更多的工作移到C中(假设不使用
lambda
函数). - (6)迄今为止最快.
- (1)/(2)相似.
- (3)/(4)相似.
- (5) is marginally quicker than (3) / (4), which we expect as more work is moved into C [assuming no
lambda
function is used]. - (6) is by far the fastest.
- (1) / (2) are similar.
- (3) / (4) are similar.
为什么x.map/x.apply快速?
This appears to be because it uses fast compiled Cython code:
cpdef ndarray[object] astype_str(ndarray arr):
cdef:
Py_ssize_t i, n = arr.size
ndarray[object] result = np.empty(n, dtype=object)
for i in range(n):
# we can use the unsafe version because we know `result` is mutable
# since it was created from `np.empty`
util.set_value_at_unsafe(result, i, str(arr[i]))
return result
为什么x.astype(str)变慢?
熊猫将str
应用于系列中的每个项目,而不使用上述Cython.
Pandas applies str
to each item in the series, not using the above Cython.
因此,性能可与[str(i) for i in x]
/list(map(str, x))
媲美.
Hence performance is comparable to [str(i) for i in x]
/ list(map(str, x))
.
为什么x.values.astype(str)这么快?
Numpy不会在数组的每个元素上应用函数. 我对此有一个描述:
Numpy does not apply a function on each element of the array. One description of this I found:
如果您做了
s.values.astype(str)
,您得到的是一个持有的对象int
.这是numpy
进行的转换,而pandas遍历了 每个项目并在其上调用str(item)
.因此,如果您执行s.astype(str)
持有str
的对象.
If you did
s.values.astype(str)
what you get back is an object holdingint
. This isnumpy
doing the conversion, whereas pandas iterates over each item and callsstr(item)
on it. So if you dos.astype(str)
you have an object holdingstr
.
出于技术原因,为什么尚未实现numpy版本如果为空,则为空.
There is a technical reason why the numpy version hasn't been implemented in the case of no-nulls.
这篇关于将一系列int转换为字符串-为什么应用比astype快得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!