不平等加入了 pandas ? [英] Inequality joins in Pandas?

查看:130
本文介绍了不平等加入了 pandas ?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我通常使用Dataframe.merge组合熊猫中的数据框.据我了解,这仅适用于平等联接.使用其他类型的检查(例如,不等式)将两个数据框连接起来的惯用方式是什么?

I usually use Dataframe.merge to combine dataframes in pandas. From my understanding, this only works on equality joins. What is the idiomatic way to join two dataframes using other types of checks (e.g. inequality)?

推荐答案

熊猫 merge()允许在两个数据帧之间进行outerleftright联接(而不仅仅是inner联接),因此您可以返回不匹配的记录.此外,甚至可以将merge()泛化为返回交叉联接(两个数据帧之间的所有组合匹配),并且随后进行过滤,可以返回不匹配的记录.此外,还有 isin() 熊猫方法.

Pandas merge() allows for outer, left, right joins (not just inner joins) between two data frames, so you can return unmatched records. Additionally, merge() can even be generalized to return a cross join (all combination matches between two data frames) and with filtering afterwards you can return unmatched records. Still more, there is the isin() pandas method.

请考虑以下演示.以下是两个我们喜欢的东西的数据帧,它们是计算机语言.如图所示,第一数据帧是第二数据帧的子集.外部联接返回带有NaN的两个不匹配列的记录,以后可以将其过滤掉.交叉联接返回完整的完整行,可以对其进行过滤,并且isin()在列中搜索值:

Consider the following demonstration. Below are two data frames of something we come to enjoy, computer languages. As seen, the first data frame is a subset of second data frame. An outer join returns records in both with NaN for unmatched columns which can be later filtered out. A cross join returns full complete rows which can be filtered and isin() searches values within columns:

import pandas as pd

df1 = pd.DataFrame({'Languages': ['C++', 'C', 'Java', 'C#', 'Python', 'PHP'],
                    'Uses': ['computing', 'computing', 'application', 'application', 'application', 'web'], 
                    'Type': ['Proprietary', 'Proprietary', 'Proprietary', 'Proprietary', 'Open-Source', 'Open-Source']})

df2 = pd.DataFrame({'Languages': ['C++', 'C', 'Java', 'C#', 'Python', 'PHP',
                                 'Perl', 'R', 'Ruby', 'VB.NET', 'Javascript', 'Matlab'],
                    'Uses': ['computing', 'computing', 'application', 'application', 'application', 'web',
                            'application', 'computing', 'web', 'application', 'web', 'computing'],
                    'Type': ['Proprietary', 'Proprietary', 'Proprietary', 'Proprietary', 'Open-Source',
                            'Open-Source', 'Open-Source', 'Open-Source', 'Open-Source', 'Proprietary',
                            'Open-Source', 'Proprietary']})    

# OUTER JOIN 
mergedf = pd.merge(df1, df2, on=['Languages'], how='outer')
# FILTER OUT LANGUAGES IN SMALLER THAT IS NULL
mergedf = mergedf[pd.isnull(mergedf['Type_x'])][['Languages', 'Uses_y', 'Type_y']]

#     Languages       Uses_y       Type_y
#6         Perl  application  Open-Source
#7            R    computing  Open-Source
#8         Ruby          web  Open-Source
#9       VB.NET  application  Proprietary
#10  Javascript          web  Open-Source
#11      Matlab    computing  Proprietary


# ISIN COMPARISON, RETURNING RECORDS IN LARGER NOT IN SMALLER
unequaldf = df2[~df2.Languages.isin(df1['Languages'])]

#     Languages         Type         Uses
#6         Perl  Open-Source  application
#7            R  Open-Source    computing
#8         Ruby  Open-Source          web
#9       VB.NET  Proprietary  application
#10  Javascript  Open-Source          web
#11      Matlab  Proprietary    computing


# CROSS JOIN 
df1['key'] = 1                 # (REQUIRES A JOIN KEY OF SAME VALUE)
df2['key'] = 1                    
crossjoindf = pd.merge(df1, df2, on=['key'])
# FILTER FOR LANGUAGES IN LARGER NOT IN SMALLER (ALSO USING ISIN)
crossjoindf = crossjoindf[~crossjoindf['Languages_y'].isin(crossjoindf['Languages_x'])]\
                    [['Languages_y', 'Uses_y', 'Type_y']].drop_duplicates()

#   Languages_y       Uses_y       Type_y
#6         Perl  application  Open-Source
#7            R    computing  Open-Source
#8         Ruby          web  Open-Source
#9       VB.NET  application  Proprietary
#10  Javascript          web  Open-Source
#11      Matlab    computing  Proprietary

诚然,交叉连接在这里可能是多余且冗长的,但是如果您无与伦比的需求需要跨数据帧进行排列,那么它会很方便.

Admittedly, the cross join may be redundant and verbose here but should your unmatched needs require permutations across data frames, it can be handy.

这篇关于不平等加入了 pandas ?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆