将 Python 函数应用于一个 Pandas 列并将输出应用于多列 [英] Apply Python function to one pandas column and apply the output to multiple columns

查看:40
本文介绍了将 Python 函数应用于一个 Pandas 列并将输出应用于多列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你好社区,

我已经阅读了很多答案和博客,但我无法弄清楚我错过了什么简单的事情!.我正在使用条件"函数来定义所有条件并将其应用于一个数据框列.如果条件满足,它应该创建/更新 2 个新的数据框列 'cat' 和 'subcat'.

I have read so many answers and blogs yet I am not able to figure out what simple thing I am missing out!. I am using 'conditions' function to define all the conditions and apply it to one dataframe column. And if the condition satisfies, it should create/update 2 new dataframe columns 'cat' and 'subcat'.

如果你们能在这里帮助我,那将是一个很大的帮助!

It would be a big help if you guys can help me out here!

dict = {'remark':['NA','NA','Category1','Category2','Category3'],
        'desc':['Present','Present','NA','Present','NA']
} 

df = pd.DataFrame(dict) 

数据框看起来像这样:

          remark       desc
0         NA           Present      
1         NA           Present        
2         Category1    NA                   
3         Category2    Present                   
4         Category3    NA            

我编写了一个函数来定义如下条件:

I have written a function to define conditions as below:

def conditions(s):

    if (s == 'Category1'):
        x = 'insufficient'
        y = 'resolution'
    elif (s=='Category2):
        x= 'insufficient'
        y= 'information'
    elif (s=='Category3):
        x= 'Duplicate'
        y= 'ID repeated'
    else:
        x= 'NA'
        y= 'NA'
    
    return (x,y)

我有多种想法可以在数据框列上执行上述函数,但没有成功.

I have multiple ideas to execute the above function on the dataframe column but no luck.

df[['cat','subcat']] = df['remark'].apply(lambda x: pd.Series([conditions(df)[0],conditions(df)[1]]))

我预期的数据框应该是这样的:

My expected dataframe should look something like this:

          remark       desc        cat           subcat
0         NA           Present     NA            NA      
1         NA           Present     NA            NA
2         Category1    NA          insufficient  resolution         
3         Category2    Present     insufficient  information              
4         Category3    NA          Duplicate     ID repeated

非常感谢.

推荐答案

解决这个问题的一种方法是使用列表推导式:

One way around this is with a list comprehension :

df[['cat', 'subcat']] = [("insufficient", "resolution")  if word == "Category1" else 
                         ("insufficient", "information") if word == "Category2" else
                         ("Duplicate", "ID repeated")    if word == "Category3" else 
                         ("NA", "NA")
                         for word in df.remark]

  remark      desc               cat         subcat
0   NA        Present          NA              NA
1   NA        Present          NA              NA
2   Category1   NA          insufficient    resolution
3   Category2   Present     insufficient    information
4   Category3   NA          Duplicate       ID repeated

@dm2 的回答显示了如何使用您的函数实现它.第一个 apply(conditions) 创建一个包含元组的系列,第二个 apply 创建单独的列,形成一个数据框,然后您可以将其分配给 catsubcat.

@dm2's answer shows how to pull it off with your function. The first apply(conditions) creates a series containing tuples, the second apply creates individual columns, forming a dataframe that you can then assign to cat and subcat.

我建议使用列表理解的原因是,您正在处理字符串,而在 Pandas 中,通过 vanilla python 处理字符串通常更快.此外,使用列表推导式处理一次,您不需要应用条件函数然后调用 pd.Series.这为您提供了更快的速度.测试将断言或揭穿这一点.

The reason why I suggest a list comprehension is because, one you are dealing with Strings, and in Pandas, working with strings via vanilla python is more often than not faster. Also, with the list comprehension the processing is done once, you do not need to apply the conditions function and then call pd.Series. That gives you a faster speed. Testing will assert or debunk this.

这篇关于将 Python 函数应用于一个 Pandas 列并将输出应用于多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆