Pandas 根据另一个数据框中的匹配列填充新的数据框列 [英] Pandas populate new dataframe column based on matching columns in another dataframe

查看:30
本文介绍了Pandas 根据另一个数据框中的匹配列填充新的数据框列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 df,其中包含我的主要数据,其中有 100 万个 .我的主要数据也有 30 个 columns.现在我想向我的 df 添加另一列,名为 category.categorydf2 中的一个 column,它包含大约 700 个 rows 和另外两个 columns> 将与 df 中的两个 columns 匹配.

I have a df which contains my main data which has one million rows. My main data also has 30 columns. Now I want to add another column to my df called category. The category is a column in df2 which contains around 700 rows and two other columns that will match with two columns in df.

我首先在 df2df 中设置一个 index 来匹配帧,但是一些 index<df2 中的/code> 在 df 中不存在.

I begin with setting an index in df2 and df that will match between the frames, however some of the index in df2 doesn't exist in df.

df2 中的其余列称为 AUTHOR_NAMECATEGORY.

The remaining columns in df2 are called AUTHOR_NAME and CATEGORY.

df 中的相关列称为 AUTHOR_NAME.

df 中的某些 AUTHOR_NAMEdf2 中不存在,反之亦然.

Some of the AUTHOR_NAME in df doesn't exist in df2 and vice versa.

我想要的指令是:当 df 中的 indexdf2 中的 index 匹配时df中的titledf2中的title匹配,将category添加到df,否则在category中添加NaN.

The instruction I want is: when index in df matches with index in df2 and title in df matches with title in df2, add category to df, else add NaN in category.

示例数据:

df2
           AUTHOR_NAME              CATEGORY
Index       
Pub1        author1                 main
Pub2        author1                 main
Pub3        author1                 main
Pub1        author2                 sub
Pub3        author2                 sub
Pub2        author4                 sub


df
            AUTHOR_NAME     ...n amount of other columns        
Index       
Pub1        author1                 
Pub2        author1     
Pub1        author2 
Pub1        author3
Pub2        author4 

expected_result
            AUTHOR_NAME             CATEGORY   ...n amount of other columns
Index
Pub1        author1                 main
Pub2        author1                 main
Pub1        author2                 sub
Pub1        author3                 NaN
Pub2        author4                 sub

如果我使用 df2.merge(df,left_index=True,right_index=True,how='left', on=['AUTHOR_NAME']) 我的 df变得比预期的大三倍.

If I use df2.merge(df,left_index=True,right_index=True,how='left', on=['AUTHOR_NAME']) my df becomes three times bigger than it is supposed to be.

所以我认为合并可能是解决此问题的错误方式.我真正想做的是使用 df2 作为查找表,然后根据是否满足某些条件将 type 值返回给 df.

So I thought maybe merging was the wrong way to go about this. What I am really trying to do is use df2 as a lookup table and then return type values to df depending on if certain conditions are met.

def calculate_category(df2, d):
    category_row = df2[(df2["Index"] == d["Index"]) & (df2["AUTHOR_NAME"] == d["AUTHOR_NAME"])]
    return str(category_row['CATEGORY'].iat[0])

df.apply(lambda d: calculate_category(df2, d), axis=1)

然而,这给我带来了一个错误:

However, this throws me an error:

IndexError: ('index out of bounds', u'occurred at index 7614')

推荐答案

考虑以下数据帧 dfdf2

Consider the following dataframes df and df2

df = pd.DataFrame(dict(
        AUTHOR_NAME=list('AAABBCCCCDEEFGG'),
        title=      list('zyxwvutsrqponml')
    ))

df2 = pd.DataFrame(dict(
        AUTHOR_NAME=list('AABCCEGG'),
        title      =list('zwvtrpml'),
        CATEGORY   =list('11223344')
    ))

选项 1
合并

df.merge(df2, how='left')

选项 2
加入

cols = ['AUTHOR_NAME', 'title']
df.join(df2.set_index(cols), on=cols)

<小时>

两种选择都会产生

这篇关于Pandas 根据另一个数据框中的匹配列填充新的数据框列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆