np.select具有两个以上的 pandas 列 [英] np.select with more than two pandas column

查看：60 发布时间：2021/5/30 19:15:01 python python-3.x pandas list numpy

本文介绍了np.select具有两个以上的 pandas 列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试解决熊猫问题陈述.熊猫的数据框如下所示:

I am trying to solve a pandas problem statement. The panda's data frame looks like this :

import numpy as np
np.random.seed(0)
import time
import pandas as pd
dataframe = pd.DataFrame({'operation': ['data_a', 'data_b', 'avg', 'concat', 'sum', 'data_a', 'concat']*100, 
             'data_a': list(np.random.uniform(-1,1,[700,2])), 'data_b': list(np.random.uniform(-1,1,[700,2]))})

列'operation'表示合并列，因此，如果列'operation'中存在'data_a'值，则意味着特定行的data_a值，如果有'avg'操作，则取该特定行的'data_a'和'data_b'的平均值等等.

Column 'operation' represent merge column, so if there is 'data_a' value in Column 'operation', it means take that particular row's data_a value, if there is 'avg' operation, then take the average of 'data_a' and 'data_b' of that particular row so on.

我期望在输出中出现一个新列，其中包含根据操作列的合并函数得出的值

What I am expecting in the output, a new column contains the values as per the operation column's merge functions

我正在处理NumPy数组的第n个昏暗的行.

I am dealing with many rows with nth dim of NumPy array.

我尝试了两种解决方案，但是都很慢.

I have tried two solutions but both are quite slow.

第一个解决方案，具有正常的python循环:

The first solution, with normal python loop :

# first solution

start = time.time()
dataframe['new_column'] = 'dummy_values'

for i in range(len(dataframe)):
    
    if dataframe['operation'].iloc[i]  == 'data_a':
        dataframe['new_column'].iloc[i] = dataframe['data_a'].iloc[i]
    elif dataframe['operation'].iloc[i] == 'data_b':
        dataframe['new_column'].iloc[i] = dataframe['data_b'].iloc[i]
    elif dataframe['operation'].iloc[i] == 'avg':
        dataframe['new_column'].iloc[i] = dataframe[['data_a','data_b']].iloc[i].mean()
    elif dataframe['operation'].iloc[i] == 'sum':
        dataframe['new_column'].iloc[i] = dataframe[['data_a','data_b']].iloc[i].sum()
    elif dataframe['operation'].iloc[i] == 'concat':
        dataframe['new_column'].iloc[i] = np.concatenate([dataframe['data_a'].iloc[i], dataframe['data_b'].iloc[i]], axis=0)
        
end = time.time()
print(end - start)

# 0.3356964588165283

这很慢，第二个解决方法是熊猫套用方法:

Which is quite slow, the Second solution is pandas apply method :

# second solution
start = time.time()
def f(x):
    if x['operation']  == 'data_a':
        return x['data_a']
    elif x['operation']  == 'data_b':
        return x['data_b']
    elif x['operation']  == 'avg':
        return x[['data_a','data_b']].mean()
    elif x['operation']  == 'sum':
        return x[['data_a','data_b']].sum()
    elif x['operation']  == 'concat':
        return  np.concatenate([x['data_a'], x['data_b']], axis=0)
        
dataframe['new_column'] = dataframe.apply(f, axis=1)

end = time.time()
print(end - start)

# 0.2401289939880371

这也相当慢.我正在尝试使用NumPy select方法来解决此问题:

Which is also quite slow. I am trying to work on NumPy select method to solve this problem:

# third solution

import numpy as np
con1 = dataframe['operation']  == 'data_a'
con2 = dataframe['operation']  == 'data_b'
con3 = dataframe['operation']  == 'avg'
con4 = dataframe['operation']  == 'sum'
con5 = dataframe['operation']  == 'mul'



val1 = dataframe['data_a']
val2 = dataframe['data_b']
val3 = dataframe[['data_b', 'data_a']].mean()
val4 = dataframe[['data_b', 'data_a']].sum()
val5 = dataframe[['data_b']]* dataframe[['data_a']]


dataframe['new_column'] = np.select([con1,con2,con3,con4,con5], [val1,val2,val3,val4,val5])

出现错误:

~/tfproject/tfenv/lib/python3.7/site-packages/numpy/lib/stride_tricks.py in _broadcast_shape(*args)
    189     # use the old-iterator because np.nditer does not handle size 0 arrays
    190     # consistently
--> 191     b = np.broadcast(*args[:32])
    192     # unfortunately, it cannot handle 32 or more arguments directly
    193     for pos in range(32, len(args), 31):

ValueError: shape mismatch: objects cannot be broadcast to a single shape

如何解决此错误，还有其他优化方法可以解决此问题吗?

How can I solve this error and is there any other optimized method to solve this problem?

谢谢！

np.select具有两个以上的 pandas 列 [英] np.select with more than two pandas column

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

np.select具有两个以上的 pandas 列 [英] np.select with more than two pandas column

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭