Pandas 比较两个 DataFrames,标记匹配的内容 [英] Pandas Compare between Two DataFrames, flag what matches
问题描述
我必须数据帧 - df
和 df1
I have to dataframes- df
and df1
df
在下面
Facility Category ID Part Text
Centennial History 11111 A Drain
Centennial History 11111 B Read
Centennial History 11111 C EKG
Centennial History 11111 D Assistant
Centennial History 11111 E Primary
df1
在下面(只包含了一个小样本,实际上是 50,000 行)
df1
is below (Just included a small sample for the question, it is actually 50,000 rows)
Facility Category ID Part Text
Centennial History 11111 D Assistant
基本上我想比较数据帧之间的行,如果两个数据帧之间的行匹配,则在第一个数据帧 df
中创建另一列,列标题为 ['MatchingFlag']代码>
Basically I want to compare rows between dataframes and if the row matches between two dataframes then create a another column in the first dataframe df
with the column header as ['MatchingFlag']
我的最终结果数据框,我想如下所示,因为我同样关心那些不匹配的数据框.
My end result dataframe, I would like to look like this below as I'm just as concerned about the ones that do not match.
Facility Category ID Part Text MatchingFlag
Centennial History 11111 A Drain No
Centennial History 11111 B Read No
Centennial History 11111 C EKG No
Centennial History 11111 D Assistant Yes
Centennial History 11111 E Primary No
有关如何执行此操作的任何帮助?我试过合并 df = pd.merge(df1, df, how='left', on=['Facility', 'Category', 'ID', 'Part', 'Text'])
两个数据帧,然后根据空白或 NaN 值创建一个标志,但这并没有达到我的希望.
Any help on how to do this? I've tried merging df = pd.merge(df1, df, how='left', on=['Facility', 'Category', 'ID', 'Part', 'Text'])
the two dataframes, and then create a flag based on blank or NaN values, but that doesn't do what I was hoping.
推荐答案
在要匹配的列上设置索引并使用该索引来排序哪些行匹配可能是有意义的
It might make sense to set an index on the columns you want to match on, and use that index to sort out which rows match
columns = ['Facility', 'Category', 'ID', 'Part', 'Text']
# It's always a good idea to sort after creating a MultiIndex like this
df = df.set_index(columns).sortlevel()
df1 = df1.set_index(columns).sortlevel()
# You don't have to use Yes here, anything will do
# The boolean True might be more appropriate
df['MatchingFlag'] = "Yes"
df1['MatchingFlag'] = "Yes"
# Add them together, matching rows will have the value "YesYes"
# Non-matches will be nan
result = df + df1
# If you'd rather not have NaN's
result.loc[:,'MatchingFlag'] = result.loc[:,'MatchingFlag'].replace('YesYes','Yes')
result.loc[:,'MatchingFlag'] = result['MatchingFlag'].fillna('No')
这篇关于Pandas 比较两个 DataFrames,标记匹配的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!