Pandas 比较两个 DataFrames,标记匹配的内容 [英] Pandas Compare between Two DataFrames, flag what matches

查看:81
本文介绍了Pandas 比较两个 DataFrames,标记匹配的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须数据帧 - dfdf1

I have to dataframes- df and df1

df 在下面

Facility    Category ID   Part  Text
Centennial  History 11111   A   Drain
Centennial  History 11111   B   Read
Centennial  History 11111   C   EKG
Centennial  History 11111   D   Assistant 
Centennial  History 11111   E   Primary

df1 在下面(只包含了一个小样本,实际上是 50,000 行)

df1 is below (Just included a small sample for the question, it is actually 50,000 rows)

Facility    Category  ID      Part   Text
Centennial  History  11111    D      Assistant 

基本上我想比较数据帧之间的行,如果两个数据帧之间的行匹配,则在第一个数据帧 df 中创建另一列,列标题为 ['MatchingFlag']

Basically I want to compare rows between dataframes and if the row matches between two dataframes then create a another column in the first dataframe df with the column header as ['MatchingFlag']

我的最终结果数据框,我想如下所示,因为我同样关心那些不匹配的数据框.

My end result dataframe, I would like to look like this below as I'm just as concerned about the ones that do not match.

Facility    Category  ID    Part    Text      MatchingFlag
Centennial  History  11111  A     Drain         No
Centennial  History  11111  B     Read          No
Centennial  History  11111  C     EKG           No
Centennial  History  11111  D     Assistant     Yes
Centennial  History  11111  E     Primary       No

有关如何执行此操作的任何帮助?我试过合并 df = pd.merge(df1, df, how='left', on=['Facility', 'Category', 'ID', 'Part', 'Text']) 两个数据帧,然后根据空白或 NaN 值创建一个标志,但这并没有达到我的希望.

Any help on how to do this? I've tried merging df = pd.merge(df1, df, how='left', on=['Facility', 'Category', 'ID', 'Part', 'Text']) the two dataframes, and then create a flag based on blank or NaN values, but that doesn't do what I was hoping.

推荐答案

在要匹配的列上设置索引并使用该索引来排序哪些行匹配可能是有意义的

It might make sense to set an index on the columns you want to match on, and use that index to sort out which rows match

columns = ['Facility', 'Category', 'ID', 'Part', 'Text']

# It's always a good idea to sort after creating a MultiIndex like this
df = df.set_index(columns).sortlevel()
df1 = df1.set_index(columns).sortlevel()

# You don't have to use Yes here, anything will do
# The boolean True might be more appropriate
df['MatchingFlag'] = "Yes"
df1['MatchingFlag'] = "Yes"

# Add them together, matching rows will have the value "YesYes"
# Non-matches will be nan
result = df + df1

# If you'd rather not have NaN's 
result.loc[:,'MatchingFlag'] = result.loc[:,'MatchingFlag'].replace('YesYes','Yes')
result.loc[:,'MatchingFlag'] = result['MatchingFlag'].fillna('No')

这篇关于Pandas 比较两个 DataFrames,标记匹配的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆