如何在非简单条件下使用 pandas 执行DataFrames的内部或外部联接 [英] how to perform an inner or outer join of DataFrames with Pandas on non-simplistic criterion

查看:70
本文介绍了如何在非简单条件下使用 pandas 执行DataFrames的内部或外部联接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出两个数据框,如下所示:

Given two dataframes as below:

>>> import pandas as pd

>>> df_a = pd.DataFrame([{"a": 1, "b": 4}, {"a": 2, "b": 5}, {"a": 3, "b": 6}])
>>> df_b = pd.DataFrame([{"c": 2, "d": 7}, {"c": 3, "d": 8}])
>>> df_a
   a  b
0  1  4
1  2  5
2  3  6

>>> df_b
   c  d
0  2  7
1  3  8

我们想使用非简单化的标准(例如,"df_b.c> df_a.a")为两个数据框生成SQL样式的连接.据我所知,虽然merge()当然是解决方案的一部分,但我不能直接使用它,因为它不接受"ON"条件的任意表达式(除非我遗漏了什么?).

we would like to produce a SQL-style join of both dataframes using a non-simplistic criteria, let's say "df_b.c > df_a.a". From what I can tell, while merge() is certainly part of the solution, I can't use it directly since it doesn't accept arbitrary expressions for "ON" criteria (unless I'm missing something?).

在SQL中,结果如下所示:

In SQL, the results look like this:

# inner join
sqlite> select * from df_a join df_b on c > a;
1|4|2|7
1|4|3|8
2|5|3|8

# outer join
sqlite> select * from df_a left outer join df_b on c > a;
1|4|2|7
1|4|3|8
2|5|3|8
3|6||

我目前用于内部联接的方法是产生笛卡尔积 df_a和df_b的值,方法是在两者中都添加一列"1",然后使用 在"1"列上进行merge(),然后应用"c> a"条件.

my current approach for inner join is to produce a cartesian product of df_a and df_b, by adding a column of "1"s to both, then using merge() on the "1"s column, then applying the "c > a" criteria.

>>> import numpy as np
>>> df_a['ones'] = np.ones(3)
>>> df_b['ones'] = np.ones(2)
>>> cartesian = pd.merge(df_a, df_b, left_on='ones', right_on='ones')
>>> cartesian
   a  b  ones  c  d
0  1  4     1  2  7
1  1  4     1  3  8
2  2  5     1  2  7
3  2  5     1  3  8
4  3  6     1  2  7
5  3  6     1  3  8
>>> cartesian[cartesian.c > cartesian.a]
   a  b  ones  c  d
0  1  4     1  2  7
1  1  4     1  3  8
3  2  5     1  3  8

对于外部联接,到目前为止,我不确定最好的方法 我一直在玩弄内在的联系,然后运用否定 获取所有其他行的条件,然后尝试对其进行编辑 否定"设置在原件上,但实际上并没有用.

for outer join, I'm not sure of the best way to go, so far I've been playing with getting the inner join, then applying the negation of the criteria to get all the other rows, then trying to edit that "negation" set onto the original, but it doesn't really work.

修改. HYRY在这里回答了具体问题,但我需要在Pandas API中更通用,更通用的东西,因为我的加入条件可以是任何东西,而不仅仅是一个比较.对于外部联接,首先我要在左侧"添加一个额外的索引,该索引将在进行内部联接后保持自身:

Edit. HYRY answered the specific question here but I needed something more generic and more within the Pandas API, as my join criterion could be anything, not just that one comparison. For outerjoin, first I'm adding an extra index to the "left" side that will maintain itself after I do the inner join:

df_a['_left_index'] = df_a.index

然后我们进行笛卡尔运算并获得内部联接:

then we do the cartesian and get the inner join:

cartesian = pd.merge(df_a, df_b, left_on='ones', right_on='ones')
innerjoin = cartesian[cartesian.c > cartesian.a]

然后在"df_a"中获取我们需要的其他索引ID,并从"df_a"中获取行:

then I get the additional index ids in "df_a" that we'll need, and get the rows from "df_a":

remaining_left_ids = set(df_a['_left_index']).\
                    difference(innerjoin['_left_index'])
remaining = df_a.ix[remaining_left_ids]

然后,我们使用一个直接的concat(),它将左侧的缺失列替换为"NaN"(我以为以前没有这样做,但我想是的):

then we use a straight concat(), which replaces missing columns with "NaN" for left (I thought it wasn't doing this earlier but I guess it does):

outerjoin = pd.concat([innerjoin, remaining]).reset_index()

HYRY的想法是只对需要比较的cols进行笛卡尔运算,这基本上是正确的答案,尽管在我的特定情况下,实现(一般化和全部化)可能会有些棘手.

HYRY's idea to do the cartesian on just those cols that we need to compare on is basically the right answer, though in my specific case it might be a little tricky to implement (generalized and all).

问题:

  1. 如何在"c> a"上产生df_1和df_2的"join"?将 您使用相同的笛卡尔积,过滤器"方法,还是有更好的方法 方式?

  1. How would you produce a "join" of df_1 and df_2 on "c > a"? Would you do the same "cartesian product, filter" approach or is there some better way?

您将如何产生相同的左外部联接"?

How would you produce the "left outer join" of same?

推荐答案

我使用ufunc的外部方法来计算结果,这是示例:

I use the outer method of ufunc to calculate the result, here is the example:

首先,一些数据:

import pandas as pd
import numpy as np
df_a = pd.DataFrame([{"a": 1, "b": 4}, {"a": 2, "b": 5}, {"a": 3, "b": 6}, {"a": 4, "b": 8}, {"a": 1, "b": 7}])
df_b = pd.DataFrame([{"c": 2, "d": 7}, {"c": 3, "d": 8}, {"c": 2, "d": 10}])
print "df_a"
print df_a
print "df_b"
print df_b

输出:

df_a
   a  b
0  1  4
1  2  5
2  3  6
3  4  8
4  1  7
df_b
   c   d
0  2   7
1  3   8
2  2  10

内部联接,因为这仅计算c&的笛卡尔积. a,内存使用率小于整个DataFrame的笛卡尔积:

Inner join, because this only calculate the cartesian product of c & a, memory useage is less than cartesian product of the whole DataFrame:

ia, ib = np.where(np.less.outer(df_a.a, df_b.c))
print pd.concat((df_a.take(ia).reset_index(drop=True), 
                 df_b.take(ib).reset_index(drop=True)), axis=1)

输出:

   a  b  c   d
0  1  4  2   7
1  1  4  3   8
2  1  4  2  10
3  2  5  3   8
4  1  7  2   7
5  1  7  3   8
6  1  7  2  10

要计算左外部联接,请使用numpy.setdiff1d()查找所有不在内部联接中的df_a行:

to calculate the left outer join, use numpy.setdiff1d() to find all the rows of df_a that not in the inner join:

na = np.setdiff1d(np.arange(len(df_a)), ia)
nb = -1 * np.ones_like(na)
oa = np.concatenate((ia, na))
ob = np.concatenate((ib, nb))
print pd.concat([df_a.take(oa).reset_index(drop=True), 
                 df_b.take(ob).reset_index(drop=True)], axis=1)

输出:

   a  b   c   d
0  1  4   2   7
1  1  4   3   8
2  1  4   2  10
3  2  5   3   8
4  1  7   2   7
5  1  7   3   8
6  1  7   2  10
7  3  6 NaN NaN
8  4  8 NaN NaN

这篇关于如何在非简单条件下使用 pandas 执行DataFrames的内部或外部联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆