不平等加入了 pandas ? [英] Inequality joins in Pandas?
问题描述
我通常使用Dataframe.merge组合熊猫中的数据框.据我了解,这仅适用于平等联接.使用其他类型的检查(例如,不等式)将两个数据框连接起来的惯用方式是什么?
I usually use Dataframe.merge to combine dataframes in pandas. From my understanding, this only works on equality joins. What is the idiomatic way to join two dataframes using other types of checks (e.g. inequality)?
推荐答案
熊猫 merge()允许在两个数据帧之间进行outer
,left
,right
联接(而不仅仅是inner
联接),因此您可以返回不匹配的记录.此外,甚至可以将merge()
泛化为返回交叉联接(两个数据帧之间的所有组合匹配),并且随后进行过滤,可以返回不匹配的记录.此外,还有 isin() 熊猫方法.
Pandas merge() allows for outer
, left
, right
joins (not just inner
joins) between two data frames, so you can return unmatched records. Additionally, merge()
can even be generalized to return a cross join (all combination matches between two data frames) and with filtering afterwards you can return unmatched records. Still more, there is the isin() pandas method.
请考虑以下演示.以下是两个我们喜欢的东西的数据帧,它们是计算机语言.如图所示,第一数据帧是第二数据帧的子集.外部联接返回带有NaN
的两个不匹配列的记录,以后可以将其过滤掉.交叉联接返回完整的完整行,可以对其进行过滤,并且isin()
在列中搜索值:
Consider the following demonstration. Below are two data frames of something we come to enjoy, computer languages. As seen, the first data frame is a subset of second data frame. An outer join returns records in both with NaN
for unmatched columns which can be later filtered out. A cross join returns full complete rows which can be filtered and isin()
searches values within columns:
import pandas as pd
df1 = pd.DataFrame({'Languages': ['C++', 'C', 'Java', 'C#', 'Python', 'PHP'],
'Uses': ['computing', 'computing', 'application', 'application', 'application', 'web'],
'Type': ['Proprietary', 'Proprietary', 'Proprietary', 'Proprietary', 'Open-Source', 'Open-Source']})
df2 = pd.DataFrame({'Languages': ['C++', 'C', 'Java', 'C#', 'Python', 'PHP',
'Perl', 'R', 'Ruby', 'VB.NET', 'Javascript', 'Matlab'],
'Uses': ['computing', 'computing', 'application', 'application', 'application', 'web',
'application', 'computing', 'web', 'application', 'web', 'computing'],
'Type': ['Proprietary', 'Proprietary', 'Proprietary', 'Proprietary', 'Open-Source',
'Open-Source', 'Open-Source', 'Open-Source', 'Open-Source', 'Proprietary',
'Open-Source', 'Proprietary']})
# OUTER JOIN
mergedf = pd.merge(df1, df2, on=['Languages'], how='outer')
# FILTER OUT LANGUAGES IN SMALLER THAT IS NULL
mergedf = mergedf[pd.isnull(mergedf['Type_x'])][['Languages', 'Uses_y', 'Type_y']]
# Languages Uses_y Type_y
#6 Perl application Open-Source
#7 R computing Open-Source
#8 Ruby web Open-Source
#9 VB.NET application Proprietary
#10 Javascript web Open-Source
#11 Matlab computing Proprietary
# ISIN COMPARISON, RETURNING RECORDS IN LARGER NOT IN SMALLER
unequaldf = df2[~df2.Languages.isin(df1['Languages'])]
# Languages Type Uses
#6 Perl Open-Source application
#7 R Open-Source computing
#8 Ruby Open-Source web
#9 VB.NET Proprietary application
#10 Javascript Open-Source web
#11 Matlab Proprietary computing
# CROSS JOIN
df1['key'] = 1 # (REQUIRES A JOIN KEY OF SAME VALUE)
df2['key'] = 1
crossjoindf = pd.merge(df1, df2, on=['key'])
# FILTER FOR LANGUAGES IN LARGER NOT IN SMALLER (ALSO USING ISIN)
crossjoindf = crossjoindf[~crossjoindf['Languages_y'].isin(crossjoindf['Languages_x'])]\
[['Languages_y', 'Uses_y', 'Type_y']].drop_duplicates()
# Languages_y Uses_y Type_y
#6 Perl application Open-Source
#7 R computing Open-Source
#8 Ruby web Open-Source
#9 VB.NET application Proprietary
#10 Javascript web Open-Source
#11 Matlab computing Proprietary
诚然,交叉连接在这里可能是多余且冗长的,但是如果您无与伦比的需求需要跨数据帧进行排列,那么它会很方便.
Admittedly, the cross join may be redundant and verbose here but should your unmatched needs require permutations across data frames, it can be handy.
这篇关于不平等加入了 pandas ?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!