Pandas 根据另一个数据框中的匹配列填充新的数据框列 [英] Pandas populate new dataframe column based on matching columns in another dataframe
问题描述
我有一个 df
,其中包含我的主要数据,其中有 100 万个 行
.我的主要数据也有 30 个 columns
.现在我想向我的 df
添加另一列,名为 category
.category
是 df2
中的一个 column
,它包含大约 700 个 rows
和另外两个 columns
> 将与 df
中的两个 columns
匹配.
I have a df
which contains my main data which has one million rows
. My main data also has 30 columns
. Now I want to add another column to my df
called category
. The category
is a column
in df2
which contains around 700 rows
and two other columns
that will match with two columns
in df
.
我首先在 df2
和 df
中设置一个 index
来匹配帧,但是一些 index<
df2
中的/code> 在 df
中不存在.
I begin with setting an index
in df2
and df
that will match between the frames, however some of the index
in df2
doesn't exist in df
.
df2
中的其余列称为 AUTHOR_NAME
和 CATEGORY
.
The remaining columns in df2
are called AUTHOR_NAME
and CATEGORY
.
df
中的相关列称为 AUTHOR_NAME
.
df
中的某些 AUTHOR_NAME
在 df2
中不存在,反之亦然.
Some of the AUTHOR_NAME
in df
doesn't exist in df2
and vice versa.
我想要的指令是:当 df
中的 index
与 df2
和 中的
与index
匹配时df
中的titledf2
中的title
匹配,将category
添加到df
,否则在category
中添加NaN.
The instruction I want is: when index
in df
matches with index
in df2
and title
in df
matches with title
in df2
, add category
to df
, else add NaN in category
.
示例数据:
df2
AUTHOR_NAME CATEGORY
Index
Pub1 author1 main
Pub2 author1 main
Pub3 author1 main
Pub1 author2 sub
Pub3 author2 sub
Pub2 author4 sub
df
AUTHOR_NAME ...n amount of other columns
Index
Pub1 author1
Pub2 author1
Pub1 author2
Pub1 author3
Pub2 author4
expected_result
AUTHOR_NAME CATEGORY ...n amount of other columns
Index
Pub1 author1 main
Pub2 author1 main
Pub1 author2 sub
Pub1 author3 NaN
Pub2 author4 sub
如果我使用 df2.merge(df,left_index=True,right_index=True,how='left', on=['AUTHOR_NAME'])
我的 df
变得比预期的大三倍.
If I use df2.merge(df,left_index=True,right_index=True,how='left', on=['AUTHOR_NAME'])
my df
becomes three times bigger than it is supposed to be.
所以我认为合并可能是解决此问题的错误方式.我真正想做的是使用 df2
作为查找表,然后根据是否满足某些条件将 type
值返回给 df
.
So I thought maybe merging was the wrong way to go about this. What I am really trying to do is use df2
as a lookup table and then return type
values to df
depending on if certain conditions are met.
def calculate_category(df2, d):
category_row = df2[(df2["Index"] == d["Index"]) & (df2["AUTHOR_NAME"] == d["AUTHOR_NAME"])]
return str(category_row['CATEGORY'].iat[0])
df.apply(lambda d: calculate_category(df2, d), axis=1)
然而,这给我带来了一个错误:
However, this throws me an error:
IndexError: ('index out of bounds', u'occurred at index 7614')
推荐答案
考虑以下数据帧 df
和 df2
Consider the following dataframes df
and df2
df = pd.DataFrame(dict(
AUTHOR_NAME=list('AAABBCCCCDEEFGG'),
title= list('zyxwvutsrqponml')
))
df2 = pd.DataFrame(dict(
AUTHOR_NAME=list('AABCCEGG'),
title =list('zwvtrpml'),
CATEGORY =list('11223344')
))
选项 1合并
df.merge(df2, how='left')
选项 2加入
cols = ['AUTHOR_NAME', 'title']
df.join(df2.set_index(cols), on=cols)
<小时>
两种选择都会产生
这篇关于Pandas 根据另一个数据框中的匹配列填充新的数据框列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!