将一系列int转换为字符串-为什么应用比astype快得多? [英] Converting a series of ints to strings - Why is apply much faster than astype?

查看:112
本文介绍了将一系列int转换为字符串-为什么应用比astype快得多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含整数的pandas.Series,但是对于某些下游工具,我需要将它们转换为字符串.因此,假设我有一个Series对象:

I have a pandas.Series containing integers, but I need to convert these to strings for some downstream tools. So suppose I had a Series object:

import numpy as np
import pandas as pd

x = pd.Series(np.random.randint(0, 100, 1000000))

在StackOverflow和其他网站上,我已经看到大多数人认为做到这一点的最佳方法是:

On StackOverflow and other websites, I've seen most people argue that the best way to do this is:

%% timeit
x = x.astype(str)

这大约需要2秒钟.

当我使用x = x.apply(str)时,只需0.2秒.

When I use x = x.apply(str), it only takes 0.2 seconds.

为什么x.astype(str)这么慢?推荐的方式应该是x.apply(str)吗?

Why is x.astype(str) so slow? Should the recommended way be x.apply(str)?

我主要对此感兴趣的是python 3的行为.

I'm mainly interested in python 3's behavior for this.

推荐答案

性能

在开始任何调查之前,有必要先查看一下实际性能,因为与普遍看法相反,list(map(str, x))似乎比x.apply(str) .

It's worth looking at actual performance before beginning any investigation since, contrary to popular opinion, list(map(str, x)) appears to be slower than x.apply(str).

import pandas as pd, numpy as np

### Versions: Pandas 0.20.3, Numpy 1.13.1, Python 3.6.2 ###

x = pd.Series(np.random.randint(0, 100, 100000))

%timeit x.apply(str)          # 42ms   (1)
%timeit x.map(str)            # 42ms   (2)
%timeit x.astype(str)         # 559ms  (3)
%timeit [str(i) for i in x]   # 566ms  (4)
%timeit list(map(str, x))     # 536ms  (5)
%timeit x.values.astype(str)  # 25ms   (6)

值得注意的点:

  1. (5)比(3)/(4)快一点,我们期望随着更多的工作移到C中(假设不使用lambda函数).
  2. (6)迄今为止最快.
  3. (1)/(2)相似.
  4. (3)/(4)相似.
  1. (5) is marginally quicker than (3) / (4), which we expect as more work is moved into C [assuming no lambda function is used].
  2. (6) is by far the fastest.
  3. (1) / (2) are similar.
  4. (3) / (4) are similar.

为什么x.map/x.apply快速?

似乎是,因为它使用了快速的

This appears to be because it uses fast compiled Cython code:

cpdef ndarray[object] astype_str(ndarray arr):
    cdef:
        Py_ssize_t i, n = arr.size
        ndarray[object] result = np.empty(n, dtype=object)

    for i in range(n):
        # we can use the unsafe version because we know `result` is mutable
        # since it was created from `np.empty`
        util.set_value_at_unsafe(result, i, str(arr[i]))

    return result

为什么x.astype(str)变慢?

熊猫将str应用于系列中的每个项目,而不使用上述Cython.

Pandas applies str to each item in the series, not using the above Cython.

因此,性能可与[str(i) for i in x]/list(map(str, x))媲美.

Hence performance is comparable to [str(i) for i in x] / list(map(str, x)).

为什么x.values.astype(str)这么快?

Numpy不会在数组的每个元素上应用函数. 我对此有一个描述:

Numpy does not apply a function on each element of the array. One description of this I found:

如果您做了s.values.astype(str),您得到的是一个持有的对象 int.这是numpy进行的转换,而pandas遍历了 每个项目并在其上调用str(item).因此,如果您执行s.astype(str) 持有str的对象.

If you did s.values.astype(str) what you get back is an object holding int. This is numpy doing the conversion, whereas pandas iterates over each item and calls str(item) on it. So if you do s.astype(str) you have an object holding str.

出于技术原因,为什么尚未实现numpy版本如果为空,则为空.

There is a technical reason why the numpy version hasn't been implemented in the case of no-nulls.

这篇关于将一系列int转换为字符串-为什么应用比astype快得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆