pandas 类似于SQL MINUS/EXCEPT运算符,使用多列 [英] Pandas analogue to SQL MINUS / EXCEPT operator, using multiple columns
问题描述
我正在寻找与 SQL MINUS(AKA EXCEPT)运算符.
这是我的意思-给定两个Pandas DataFrame,如下所示:
Here is what I mean - given two Pandas DataFrames as follows:
In [77]: d1
Out[77]:
a b c
0 0 0 1
1 0 1 2
2 1 0 3
3 1 1 4
4 0 0 5
5 1 1 6
6 2 2 7
In [78]: d2
Out[78]:
a b c
0 1 1 10
1 0 0 11
2 1 1 12
如何仅考虑列"a"
和"b"
来查找d1 MINUS d2
的结果,以获得以下结果:
How to find a result of d1 MINUS d2
taking into account only columns "a"
and "b"
in order to get the following result:
In [62]: res
Out[62]:
a b c
1 0 1 2
2 1 0 3
6 2 2 7
MVCE:
d1 = pd.DataFrame({
'a': [0, 0, 1, 1, 0, 1, 2],
'b': [0, 1, 0, 1, 0, 1, 2],
'c': [1, 2, 3, 4, 5, 6, 7]
})
d2 = pd.DataFrame({
'a': [1, 0, 1],
'b': [1, 0, 1],
'c': [10, 11, 12]
})
我尝试了什么:
What have I tried:
In [65]: tmp1 = d1.reset_index().set_index(["a", "b"])
In [66]: idx = tmp1.index.difference(d2.set_index(["a","b"]).index)
In [67]: res = d1.loc[tmp1.loc[idx, "index"]]
In [68]: res
Out[68]:
a b c
1 0 1 2
2 1 0 3
6 2 2 7
它给了我正确的结果,但是我感觉必须有一种更惯用的,更好的/更简洁的方法来实现这一目标.
it gives me correct results, but I have a feeling that there must be a more idiomatic and nicer / cleaner way to achieve that.
PS DataFrame.isin()方法在这种情况下无济于事,因为它会产生错误的结果集
PS DataFrame.isin() method won't help in this case as it'll produce a wrong result set
推荐答案
我们可以使用 drop_duplicates
在此处,并传递参数以使用keep=False
删除所有重复项:
We can use pandas.concat
with drop_duplicates
here and pass it the argument to drop all duplicates with keep=False
:
pd.concat([d1, d2]).drop_duplicates(['a', 'b'], keep=False)
a b c
1 0 1 2
2 1 0 3
6 2 2 7
由OP评论后编辑
如果您要确保考虑到df2
中的唯一行,我们可以复制df
:
If you want to make sure that unique rows in df2
arnt taken into account, we can duplicate that df
:
pd.concat([d1, pd.concat([d2]*2)]).drop_duplicates(['a', 'b'], keep=False)
a b c
1 0 1 2
2 1 0 3
6 2 2 7
这篇关于 pandas 类似于SQL MINUS/EXCEPT运算符,使用多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!