如何在基于另一个DataFrame的列上删除DataFrame中的行? [英] How to remove rows in DataFrame on column based on another DataFrame?

查看:595
本文介绍了如何在基于另一个DataFrame的列上删除DataFrame中的行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用

I'm trying to use SQLContext.subtract() in Spark 1.6.1 to remove rows from a dataframe based on a column from another dataframe. Let's use an example:

from pyspark.sql import Row

df1 = sqlContext.createDataFrame([
    Row(name='Alice', age=2),
    Row(name='Bob', age=1),
]).alias('df1')

df2 = sqlContext.createDataFrame([
    Row(name='Bob'),
])

df1_with_df2 = df1.join(df2, 'name').select('df1.*')
df1_without_df2 = df1.subtract(df1_with_df2)

因为我希望所有来自df1的行均不包含name='Bob',所以我希望Row(age=2, name='Alice').但是我也找了鲍勃:

Since I want all rows from df1 which don't include name='Bob' I expect Row(age=2, name='Alice'). But I also retrieve Bob:

print(df1_without_df2.collect())
# [Row(age='1', name='Bob'), Row(age='2', name='Alice')]

经过各种尝试以了解此 MCVE 后,我发现问题出在age键上.如果我忽略它:

After various experiments to get down to this MCVE, I found out that the issue is with the age key. If I omit it:

df1_noage = sqlContext.createDataFrame([
    Row(name='Alice'),
    Row(name='Bob'),
]).alias('df1_noage')

df1_noage_with_df2 = df1_noage.join(df2, 'name').select('df1_noage.*')
df1_noage_without_df2 = df1_noage.subtract(df1_noage_with_df2)
print(df1_noage_without_df2.collect())
# [Row(name='Alice')]

那我只能按预期得到爱丽丝.我最奇怪的发现是,可以添加键,只要它们是 之后(按字典顺序),就可以在联接中使用该键:

Then I only get Alice as expected. The weirdest observation I made is that it's possible to add keys, as long as they're after (in the lexicographical order sense) the key I use in the join:

df1_zage = sqlContext.createDataFrame([
    Row(zage=2, name='Alice'),
    Row(zage=1, name='Bob'),
]).alias('df1_zage')

df1_zage_with_df2 = df1_zage.join(df2, 'name').select('df1_zage.*')
df1_zage_without_df2 = df1_zage.subtract(df1_zage_with_df2)
print(df1_zage_without_df2.collect())
# [Row(name='Alice', zage=2)]

我正确地得到了爱丽丝(和她的zage)!在我的实际示例中,我对所有列都感兴趣,而不仅是name之后的那些列.

I correctly get Alice (with her zage)! In my real examples, I'm interested in all columns, not only the ones that are after name.

推荐答案

这里有一些错误(第一个问题似乎与

Well there are some bugs here (the first issue looks like related to to the same problem as SPARK-6231) and JIRA looks like a good idea, but SUBTRACT / EXCEPT is no the right choice for partial matches.

相反,从 Spark 2.0 开始,您可以使用反连接:

Instead, as of Spark 2.0, you can use anti-join:

df1.join(df1_with_df2, ["name"], "leftanti").show()

在1.6中,您可以使用标准外部联接执行几乎相同的操作:

In 1.6 you can do pretty much the same thing with standard outer join:

import pyspark.sql.functions as F

ref = df1_with_df2.select("name").alias("ref")

(df1
    .join(ref, ref.name == df1.name, "leftouter")
    .filter(F.isnull("ref.name"))
    .drop(F.col("ref.name")))

这篇关于如何在基于另一个DataFrame的列上删除DataFrame中的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆