特定的pandas列作为df.apply输出的新列中的参数 [英] Specific pandas columns as arguments in new column of df.apply outputs

查看:331
本文介绍了特定的pandas列作为df.apply输出的新列中的参数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出如下熊猫数据框:

import pandas as pd
from sklearn.metrics import mean_squared_error

    df = pd.DataFrame.from_dict(  
         {'row': ['a','b','c','d','e','y'],
            'a': [ 0, -.8,-.6,-.3, .8, .01],
            'b': [-.8,  0, .5, .7,-.9, .01],
            'c': [-.6, .5,  0, .3, .1, .01],
            'd': [-.3, .7, .3,  0, .2, .01],
            'e': [ .8,-.9, .1, .2,  0, .01],
            'y': [ .01, .01, .01, .01,  .01, 0],
       }).set_index('row')
df.columns.names = ['col']

我想创建一个新的RMSE值列(来自 scikit-learn ),并使用特定的列作为参数.即,列y_true = df['a','b','c']y_pred = df['x','y','x'].使用迭代方法很容易做到这一点:

I want to create a new column of RMSE values (from scikit-learn) using specific columns for the arguments. Namely, the columns y_true = df['a','b','c'] vs y_pred = df['x','y','x']. This was easy to do using an iterative approach:

for tup in df.itertuples():
    df.at[tup[0], 'rmse']  = mean_squared_error(tup[1:4], tup[4:7])**0.5

这给出了预期的结果:

col     a     b     c     d     e     y      rmse
row                                              
a    0.00 -0.80 -0.60 -0.30  0.80  0.01  1.003677
b   -0.80  0.00  0.50  0.70 -0.90  0.01  1.048825
c   -0.60  0.50  0.00  0.30  0.10  0.01  0.568653
d   -0.30  0.70  0.30  0.00  0.20  0.01  0.375988
e    0.80 -0.90  0.10  0.20  0.00  0.01  0.626658
y    0.01  0.01  0.01  0.01  0.01  0.00  0.005774

但是我想要一个更高性能的解决方案,可能使用矢量化,因为我的数据框具有形状(180000000,52).我也不喜欢按元组位置而不是按列名进行索引.下面的尝试:

But I want a higher-performance solution, possibly using vectorization, since my dataframe has shape (180000000, 52). I also dislike indexing by tuple position rather than by column name. The attempt below:

df['rmse'] = df.apply(mean_squared_error(df[['a','b','c']], df[['d','e','y']])**0.5, axis=1)

得到错误:

TypeError: ("'numpy.float64' object is not callable", 'occurred at index a')

那么我使用df.apply()怎么了?这样甚至可以在迭代中最大化性能吗?

So what am I doing wrong with my use of df.apply()? Does this even maximize performance over iteration?

我已经使用以下测试df测试了前两个响应者中每个人的挂墙时间:

I've tested the wall times for each of the first two respondants using the below test df:

# set up test df
dim_x, dim_y = 50, 1000000
cols = ["a_"+str(i) for i in range(1,(dim_x//2)+1)]
cols_b = ["b_"+str(i) for i in range(1,(dim_x//2)+1)]
cols.extend(cols_b)
shuffle(cols)
df = pd.DataFrame(np.random.uniform(0,10,[dim_y, dim_x]), columns=cols)  #, index=idx, columns=cols
a = df.values

# define column samples
def column_index(df, query_cols):
    cols = df.columns.values
    sidx = np.argsort(cols)
    return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]

c0 = [s for s in cols if "a" in s]
c1 = [s for s in cols if "b" in s]
s0 = a[:,column_index(df, c0)]
s1 = a[:,column_index(df, c1)]

结果如下:

%%time
# approach 1 - divakar
rmse_out = np.sqrt(((s0 - s1)**2).mean(1))
df['rmse_out'] = rmse_out

Wall time: 393 ms

%%time
# approach 2 - divakar
diffs = s0 - s1
rmse_out = np.sqrt(np.einsum('ij,ij->i',diffs,diffs)/3.0)
df['rmse_out'] = rmse_out

Wall time: 228 ms

%%time
# approach 3 - divakar
diffs = s0 - s1
rmse_out = np.sqrt((np.einsum('ij,ij->i',s0,s0) + \
         np.einsum('ij,ij->i',s1,s1) - \
       2*np.einsum('ij,ij->i',s0,s1))/3.0)
df['rmse_out'] = rmse_out

Wall time: 421 ms

几分钟后,使用apply函数的解决方案仍在运行...

The solution using the apply function is still running after several minutes...

推荐答案

方法1

提高性能的一种方法是将基础数组数据与NumPy ufuncs一起使用,并切片这两个列块以向量化的方式使用这些ufuncs-

One approach for performance would be to use the underlying array data alongwith NumPy ufuncs, alongwith slicing those two blocks of columns to use those ufuncs in a vectorized manner, like so -

a = df.values
rmse_out = np.sqrt(((a[:,0:3] - a[:,3:6])**2).mean(1))
df['rmse_out'] = rmse_out

方法2

使用np.einsum替换squared-summation-

diffs = a[:,0:3] - a[:,3:6]
rmse_out = np.sqrt(np.einsum('ij,ij->i',diffs,diffs)/3.0)

方法3

使用公式计算rmse_out的另一种方法:

Another way to compute rmse_out using the formula :

(a-b)^ 2 = a ^ 2 + b ^ 2-2ab

(a - b)^2 = a^2 + b^2 - 2ab

将要提取切片:

s0 = a[:,0:3]
s1 = a[:,3:6]

然后,rmse_out应该是-

np.sqrt(((s0**2).sum(1) + (s1**2).sum(1) - (2*s0*s1).sum(1))/3.0)

einsum一起变为-

np.sqrt((np.einsum('ij,ij->i',s0,s0) + \
         np.einsum('ij,ij->i',s1,s1) - \
       2*np.einsum('ij,ij->i',s0,s1))/3.0)


获取相应的列索引

如果您不确定列a,b,..是否按该顺序排列,我们可以使用 column_index找到这些索引.

If you are not sure whether the columns a,b,.. would be in that order or not, we could find those indices with column_index.

因此a[:,0:3]将被a[:,column_index(df, ['a','b','c'])]替换,而a[:,3:6]a[:,column_index(df, ['d','e','y'])]替换.

Thus a[:,0:3] would be replaced by a[:,column_index(df, ['a','b','c'])] and a[:,3:6] by a[:,column_index(df, ['d','e','y'])].

这篇关于特定的pandas列作为df.apply输出的新列中的参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆