pandas :通过摆脱DataFrame.apply()优化一些python代码 [英] Pandas: optimizing some python code by getting rid of DataFrame.apply()

查看:193
本文介绍了 pandas :通过摆脱DataFrame.apply()优化一些python代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下代码是使用python 2.7和pandas 0.9.1.生成的.

我有一个带有两列次要"和主要"的数据框.我通过取两者的最大绝对值来计算关键"值,并建立一个名为关键"的新列:

>>> import pandas as pd
>>> df = pd.DataFrame(
...:     {'minor':[-6, -2.3, 19.2], 'major':[2, 3, 7.4]},
...:     index=[10,20,30])
>>> print df
    major  minor
10    2.0   -6.0
20    3.0   -2.3
30    7.4   19.2
>>> df['critic'] = df[['minor', 'major']].abs().max(axis=1)
>>> print df
    major  minor  critic
10    2.0   -6.0     6.0
20    3.0   -2.3     3.0
30    7.4   19.2    19.2  

我的问题是建立一个新列,假设'critic_vector'显示给出该值的列名.到目前为止,我是通过以下方式使用DataFrame.apply()的:

>>> def get_col_name(row, df, headers):
        tmp = (abs(df[headers].ix[row.name]) == row['critic'])
        retval = tmp.index[tmp.argmax()]
        return retval
>>> df['critic_vector'] = df.apply(get_col_name,
                                     axis=1,
                                     args=(df ,['minor', 'major']))
>>>print df
    major  minor  critic critic_vector
10    2.0   -6.0     6.0       minor
20    3.0   -2.3     3.0       major
30    7.4   19.2    19.2       minor

它可以正常工作;但是,处理大量数据时,df.apply()函数是我的第一个瓶颈.有没有一种方法可以直接使用df.apply()?

预先感谢

解决方案

随机想法:要获取索引,可以使用.idxmax代替max,即

>>> w = df[['minor','major']].abs().idxmax(axis=1)
>>> w
10    minor
20    major
30    minor
dtype: object

然后可以使用lookup(可能更简单一些,但是我现在想念它):

>>> df.lookup(df.index, w)
array([ -6. ,   3. ,  19.2])

IOW:

>>> df['critic_vector'] = df[['minor','major']].abs().idxmax(axis=1)
>>> df['critic'] = abs(df.lookup(df.index, df.critic_vector))
>>> df
    major  minor critic_vector  critic
10    2.0   -6.0         minor     6.0
20    3.0   -2.3         major     3.0
30    7.4   19.2         minor    19.2

我对lookup行不太满意-当然可以用原始的max调用替换它-但是我认为idxmax方法不是一个坏方法. /p>

the following code is produced using python 2.7 and pandas 0.9.1.

I have a dataframe with two columns 'minor' and 'major'. I calculate the "critical" value by taking the max absolute value of both, and build a new column called 'critic':

>>> import pandas as pd
>>> df = pd.DataFrame(
...:     {'minor':[-6, -2.3, 19.2], 'major':[2, 3, 7.4]},
...:     index=[10,20,30])
>>> print df
    major  minor
10    2.0   -6.0
20    3.0   -2.3
30    7.4   19.2
>>> df['critic'] = df[['minor', 'major']].abs().max(axis=1)
>>> print df
    major  minor  critic
10    2.0   -6.0     6.0
20    3.0   -2.3     3.0
30    7.4   19.2    19.2  

My issue is to build a new column, let say, 'critic_vector' showing the column's name who gave this value. Until now, I was using DataFrame.apply() this way:

>>> def get_col_name(row, df, headers):
        tmp = (abs(df[headers].ix[row.name]) == row['critic'])
        retval = tmp.index[tmp.argmax()]
        return retval
>>> df['critic_vector'] = df.apply(get_col_name,
                                     axis=1,
                                     args=(df ,['minor', 'major']))
>>>print df
    major  minor  critic critic_vector
10    2.0   -6.0     6.0       minor
20    3.0   -2.3     3.0       major
30    7.4   19.2    19.2       minor

It works correctly; however, working with big amount of data, the df.apply() function is my first bottleneck. Is there a way to do it in a straight way, without using df.apply() ?

Thanks in advance

解决方案

Random thoughts: to get the indices, you can use .idxmax instead of max, namely

>>> w = df[['minor','major']].abs().idxmax(axis=1)
>>> w
10    minor
20    major
30    minor
dtype: object

and then you could use lookup (there's probably something simpler, but I'm missing it right now):

>>> df.lookup(df.index, w)
array([ -6. ,   3. ,  19.2])

IOW:

>>> df['critic_vector'] = df[['minor','major']].abs().idxmax(axis=1)
>>> df['critic'] = abs(df.lookup(df.index, df.critic_vector))
>>> df
    major  minor critic_vector  critic
10    2.0   -6.0         minor     6.0
20    3.0   -2.3         major     3.0
30    7.4   19.2         minor    19.2

I'm not super-happy with the lookup line -- you could replace it with your original max call, of course -- but I think the idxmax approach isn't a bad one.

这篇关于 pandas :通过摆脱DataFrame.apply()优化一些python代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆