np.select具有两个以上的 pandas 列 [英] np.select with more than two pandas column
问题描述
我正在尝试解决熊猫问题陈述.熊猫的数据框如下所示:
I am trying to solve a pandas problem statement. The panda's data frame looks like this :
import numpy as np
np.random.seed(0)
import time
import pandas as pd
dataframe = pd.DataFrame({'operation': ['data_a', 'data_b', 'avg', 'concat', 'sum', 'data_a', 'concat']*100,
'data_a': list(np.random.uniform(-1,1,[700,2])), 'data_b': list(np.random.uniform(-1,1,[700,2]))})
列'operation'
表示合并列,因此,如果列'operation'
中存在'data_a'
值,则意味着特定行的data_a值,如果有'avg'
操作,则取该特定行的'data_a'
和'data_b'
的平均值等等.
Column 'operation'
represent merge column, so if there is 'data_a'
value in Column 'operation'
, it means take that particular row's data_a value, if there is 'avg'
operation, then take the average of 'data_a'
and 'data_b'
of that particular row so on.
我期望在输出中出现一个新列,其中包含根据操作列的合并函数得出的值
What I am expecting in the output, a new column contains the values as per the operation column's merge functions
我正在处理NumPy数组的第n个昏暗的行.
I am dealing with many rows with nth dim of NumPy array.
我尝试了两种解决方案,但是都很慢.
I have tried two solutions but both are quite slow.
第一个解决方案,具有正常的python循环:
The first solution, with normal python loop :
# first solution
start = time.time()
dataframe['new_column'] = 'dummy_values'
for i in range(len(dataframe)):
if dataframe['operation'].iloc[i] == 'data_a':
dataframe['new_column'].iloc[i] = dataframe['data_a'].iloc[i]
elif dataframe['operation'].iloc[i] == 'data_b':
dataframe['new_column'].iloc[i] = dataframe['data_b'].iloc[i]
elif dataframe['operation'].iloc[i] == 'avg':
dataframe['new_column'].iloc[i] = dataframe[['data_a','data_b']].iloc[i].mean()
elif dataframe['operation'].iloc[i] == 'sum':
dataframe['new_column'].iloc[i] = dataframe[['data_a','data_b']].iloc[i].sum()
elif dataframe['operation'].iloc[i] == 'concat':
dataframe['new_column'].iloc[i] = np.concatenate([dataframe['data_a'].iloc[i], dataframe['data_b'].iloc[i]], axis=0)
end = time.time()
print(end - start)
# 0.3356964588165283
这很慢,第二个解决方法是熊猫套用方法:
Which is quite slow, the Second solution is pandas apply method :
# second solution
start = time.time()
def f(x):
if x['operation'] == 'data_a':
return x['data_a']
elif x['operation'] == 'data_b':
return x['data_b']
elif x['operation'] == 'avg':
return x[['data_a','data_b']].mean()
elif x['operation'] == 'sum':
return x[['data_a','data_b']].sum()
elif x['operation'] == 'concat':
return np.concatenate([x['data_a'], x['data_b']], axis=0)
dataframe['new_column'] = dataframe.apply(f, axis=1)
end = time.time()
print(end - start)
# 0.2401289939880371
这也相当慢.我正在尝试使用NumPy select方法来解决此问题:
Which is also quite slow. I am trying to work on NumPy select method to solve this problem:
# third solution
import numpy as np
con1 = dataframe['operation'] == 'data_a'
con2 = dataframe['operation'] == 'data_b'
con3 = dataframe['operation'] == 'avg'
con4 = dataframe['operation'] == 'sum'
con5 = dataframe['operation'] == 'mul'
val1 = dataframe['data_a']
val2 = dataframe['data_b']
val3 = dataframe[['data_b', 'data_a']].mean()
val4 = dataframe[['data_b', 'data_a']].sum()
val5 = dataframe[['data_b']]* dataframe[['data_a']]
dataframe['new_column'] = np.select([con1,con2,con3,con4,con5], [val1,val2,val3,val4,val5])
出现错误:
~/tfproject/tfenv/lib/python3.7/site-packages/numpy/lib/stride_tricks.py in _broadcast_shape(*args)
189 # use the old-iterator because np.nditer does not handle size 0 arrays
190 # consistently
--> 191 b = np.broadcast(*args[:32])
192 # unfortunately, it cannot handle 32 or more arguments directly
193 for pos in range(32, len(args), 31):
ValueError: shape mismatch: objects cannot be broadcast to a single shape
如何解决此错误,还有其他优化方法可以解决此问题吗?
How can I solve this error and is there any other optimized method to solve this problem?
谢谢!
推荐答案
您可以使用pandas遮罩对它进行矢量化处理,这样您就可以执行所需的操作,但仍具有矢量化的优势.为了简洁起见,df是您的数据框:
You can vectorize this with pandas masking, so that you are only doing the operations needed, but still have the advantages of vectorization. For brevity df is your dataframe:
df['new_column'] = np.nan
mask = df['operation']=='data_a'
df.loc[mask, 'new_column'] = df.loc[mask, 'data_a']
mask = df['operation']=='data_b'
df.loc[mask, 'new_column'] = df.loc[mask, 'data_b']
mask = df['operation']=='avg'
df.loc[mask, 'new_column'] = (df.loc[mask, 'data_a'] + df.loc[mask, 'data_b'])/2
# etc
这篇关于np.select具有两个以上的 pandas 列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!