pandas 根据其他列的条件为其添加值 [英] Pandas add column with value based on condition based on other columns

查看:64
本文介绍了 pandas 根据其他列的条件为其添加值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下熊猫数据框:

import pandas as pd
import numpy as np

d = {'age' : [21, 45, 45, 5],
     'salary' : [20, 40, 10, 100]}

df = pd.DataFrame(d)

,并希望添加一个称为"is_rich"的额外列,该列捕获一个人是否有钱,具体取决于他/她的薪水.我找到了多种方法来实现此目的:

and would like to add an extra column called "is_rich" which captures if a person is rich depending on his/her salary. I found multiple ways to accomplish this:

# method 1
df['is_rich_method1'] = np.where(df['salary']>=50, 'yes', 'no')

# method 2
df['is_rich_method2'] = ['yes' if x >= 50 else 'no' for x in df['salary']]

# method 3
df['is_rich_method3'] = 'no'
df.loc[df['salary'] > 50,'is_rich_method3'] = 'yes'

导致:

但是我不知道首选的方法是什么.是否所有方法都一样好取决于您的应用程序?

However I don't understand what the preferred way is. Are all methods equally good depending on your application?

推荐答案

使用timeits,卢克!

结论
列表推导在较小的数据量上表现最佳,因为即使没有向量化,列表推导也只会产生很少的开销. OTOH在较大的数据上,locnumpy.where的效果更好-矢量化赢得了胜利.

Conclusion
List comprehensions perform the best on smaller amounts of data because they incur very little overhead, even though they are not vectorized. OTOH, on larger data, loc and numpy.where perform better - vectorisation wins the day.

请记住,方法的适用性取决于您的数据,条件数和列的数据类型.我的建议是在设置选项之前测试数据的各种方法.

Keep in mind that the applicability of a method depends on your data, the number of conditions, and the data type of your columns. My suggestion is to test various methods on your data before settling on an option.

但是,可以肯定的是,列表推导具有相当的竞争力-它们是用C语言实现的,并且针对性能进行了高度优化.

One sure take away from here, however, is that list comprehensions are pretty competitive—they're implemented in C and are highly optimised for performance.

基准化代码,以供参考.这是要计时的功能:

Benchmarking code, for reference. Here are the functions being timed:

def numpy_where(df):
  return df.assign(is_rich=np.where(df['salary'] >= 50, 'yes', 'no'))

def list_comp(df):
  return df.assign(is_rich=['yes' if x >= 50 else 'no' for x in df['salary']])

def loc(df):
  df = df.assign(is_rich='no')
  df.loc[df['salary'] > 50, 'is_rich'] = 'yes'
  return df

这篇关于 pandas 根据其他列的条件为其添加值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆