在 pandas 数据框中高效地对多列应用多个条件 [英] Applying multiple conditions for multiple columns in pandas dataframe efficiently
本文介绍了在 pandas 数据框中高效地对多列应用多个条件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个包含数十列的DataFrame。
Therapy area Procedures1 Procedures2 Procedures3
Oncology 450 450 2345
Oncology 367 367 415
Oncology 152 152 4945
Oncology 876 876 345
Oncology 1098 1098 12
Oncology 1348 1348 234
Nononcology 225 225 345
Nononcology 300 300 44
Nononcology 267 267 45
Nononcology 90 90 4567
我要将所有Procedure
列中的数值更改为存储桶。
对于一列,它将类似于
def hello(x):
if x['Therapy area'] == 'Oncology' and x['Procedures1'] < 200: return int(1)
if x['Therapy area'] == 'Oncology' and x['Procedures1'] in range (200, 500): return 2
if x['Therapy area'] == 'Oncology' and x['Procedures1'] in range (500, 1000): return 3
if x['Therapy area'] == 'Oncology' and x['Procedures1'] > 1000: return 4
if x['Therapy area'] != 'Oncology' and x['Procedures1'] < 200: return 11
if x['Therapy area'] != 'Oncology' and x['Procedures1'] in range (200, 500): return 22
if x['Therapy area'] != 'Oncology' and x['Procedures1'] in range (500, 1000): return 33
if x['Therapy area'] != 'Oncology' and x['Procedures1'] > 1000: return 44
test['Procedures1'] = test.apply(hello, axis=1)
对于具有不同列名(不是Procedures1
、Procedures2
、‘Procedures3`’等)的数十个列,最有效的方法是什么?
更新
我添加了第三列,但代码无法工作,并出现错误。
ValueError: bins must increase monotonically.
Bins没有直接回答我的问题。我可以有不同的价值观。我更喜欢具有逻辑运算的解决方案,而不是箱子。
对于非肿瘤性疾病,垃圾桶也可以不同,如11、22、33、44
推荐答案
您可以apply
pd.cut
到相关列:
cols = ['Procedures1', 'Procedures2']
df[cols] = df[cols].apply(lambda col: pd.cut(col, [0,200,500,1000, col.max()], labels=[1,2,3,4]))
输出:
Therapy_area Procedures1 Procedures2
0 Oncology 2 2
1 Oncology 2 2
2 Oncology 1 1
3 Oncology 3 3
4 Oncology 4 4
5 Oncology 4 4
6 Nononcology 2 2
7 Nononcology 2 2
8 Nononcology 2 2
9 Nononcology 1 1
您还可以使用np.select
:
def encoding(col, labels):
return np.select([col<200, col.between(200,500), col.between(500,1000), col>1000], labels, 0)
onc_labels = [1,2,3,4]
nonc_labels = [11,22,33,44]
msk = df['Therapy_area'] == 'Oncology'
df[cols] = pd.concat((df.loc[msk, cols].apply(encoding, args=(onc_labels,)), df.loc[msk, cols].apply(encoding, args=(nonc_labels,)))).reset_index(drop=True)
输出:
Therapy_area Procedures1 Procedures2 Procedures3
0 Oncology 2 2 4
1 Oncology 2 2 2
2 Oncology 1 1 4
3 Oncology 3 3 2
4 Oncology 4 4 1
5 Oncology 4 4 2
6 Nononcology 22 22 44
7 Nononcology 22 22 22
8 Nononcology 11 11 44
9 Nononcology 33 33 22
这篇关于在 pandas 数据框中高效地对多列应用多个条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文