np.select具有两个以上的 pandas 列 [英] np.select with more than two pandas column

查看:60
本文介绍了np.select具有两个以上的 pandas 列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解决熊猫问题陈述.熊猫的数据框如下所示:

I am trying to solve a pandas problem statement. The panda's data frame looks like this :

import numpy as np
np.random.seed(0)
import time
import pandas as pd
dataframe = pd.DataFrame({'operation': ['data_a', 'data_b', 'avg', 'concat', 'sum', 'data_a', 'concat']*100, 
             'data_a': list(np.random.uniform(-1,1,[700,2])), 'data_b': list(np.random.uniform(-1,1,[700,2]))})

'operation'表示合并列,因此,如果列'operation'中存在'data_a'值,则意味着特定行的data_a值,如果有'avg'操作,则取该特定行的'data_a''data_b'的平均值等等.

Column 'operation' represent merge column, so if there is 'data_a' value in Column 'operation', it means take that particular row's data_a value, if there is 'avg' operation, then take the average of 'data_a' and 'data_b' of that particular row so on.

我期望在输出中出现一个新列,其中包含根据操作列的合并函数得出的值

What I am expecting in the output, a new column contains the values as per the operation column's merge functions

我正在处理NumPy数组的第n个昏暗的行.

I am dealing with many rows with nth dim of NumPy array.

我尝试了两种解决方案,但是都很慢.

I have tried two solutions but both are quite slow.

第一个解决方案,具有正常的python循环:

The first solution, with normal python loop :

# first solution

start = time.time()
dataframe['new_column'] = 'dummy_values'

for i in range(len(dataframe)):
    
    if dataframe['operation'].iloc[i]  == 'data_a':
        dataframe['new_column'].iloc[i] = dataframe['data_a'].iloc[i]
    elif dataframe['operation'].iloc[i] == 'data_b':
        dataframe['new_column'].iloc[i] = dataframe['data_b'].iloc[i]
    elif dataframe['operation'].iloc[i] == 'avg':
        dataframe['new_column'].iloc[i] = dataframe[['data_a','data_b']].iloc[i].mean()
    elif dataframe['operation'].iloc[i] == 'sum':
        dataframe['new_column'].iloc[i] = dataframe[['data_a','data_b']].iloc[i].sum()
    elif dataframe['operation'].iloc[i] == 'concat':
        dataframe['new_column'].iloc[i] = np.concatenate([dataframe['data_a'].iloc[i], dataframe['data_b'].iloc[i]], axis=0)
        
end = time.time()
print(end - start)

# 0.3356964588165283

这很慢,第二个解决方法是熊猫套用方法:

Which is quite slow, the Second solution is pandas apply method :

# second solution
start = time.time()
def f(x):
    if x['operation']  == 'data_a':
        return x['data_a']
    elif x['operation']  == 'data_b':
        return x['data_b']
    elif x['operation']  == 'avg':
        return x[['data_a','data_b']].mean()
    elif x['operation']  == 'sum':
        return x[['data_a','data_b']].sum()
    elif x['operation']  == 'concat':
        return  np.concatenate([x['data_a'], x['data_b']], axis=0)
        
dataframe['new_column'] = dataframe.apply(f, axis=1)

end = time.time()
print(end - start)

# 0.2401289939880371

这也相当慢.我正在尝试使用NumPy select方法来解决此问题:

Which is also quite slow. I am trying to work on NumPy select method to solve this problem:

# third solution

import numpy as np
con1 = dataframe['operation']  == 'data_a'
con2 = dataframe['operation']  == 'data_b'
con3 = dataframe['operation']  == 'avg'
con4 = dataframe['operation']  == 'sum'
con5 = dataframe['operation']  == 'mul'



val1 = dataframe['data_a']
val2 = dataframe['data_b']
val3 = dataframe[['data_b', 'data_a']].mean()
val4 = dataframe[['data_b', 'data_a']].sum()
val5 = dataframe[['data_b']]* dataframe[['data_a']]


dataframe['new_column'] = np.select([con1,con2,con3,con4,con5], [val1,val2,val3,val4,val5])

出现错误:

~/tfproject/tfenv/lib/python3.7/site-packages/numpy/lib/stride_tricks.py in _broadcast_shape(*args)
    189     # use the old-iterator because np.nditer does not handle size 0 arrays
    190     # consistently
--> 191     b = np.broadcast(*args[:32])
    192     # unfortunately, it cannot handle 32 or more arguments directly
    193     for pos in range(32, len(args), 31):

ValueError: shape mismatch: objects cannot be broadcast to a single shape

如何解决此错误,还有其他优化方法可以解决此问题吗?

How can I solve this error and is there any other optimized method to solve this problem?

谢谢!

推荐答案

您可以使用pandas遮罩对它进行矢量化处理,这样您就可以执行所需的操作,但仍具有矢量化的优势.为了简洁起见,df是您的数据框:

You can vectorize this with pandas masking, so that you are only doing the operations needed, but still have the advantages of vectorization. For brevity df is your dataframe:

df['new_column'] = np.nan
mask = df['operation']=='data_a'
df.loc[mask, 'new_column'] = df.loc[mask, 'data_a']
mask = df['operation']=='data_b'
df.loc[mask, 'new_column'] = df.loc[mask, 'data_b']
mask = df['operation']=='avg'
df.loc[mask, 'new_column'] = (df.loc[mask, 'data_a'] + df.loc[mask, 'data_b'])/2
# etc

这篇关于np.select具有两个以上的 pandas 列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆