如何基于另一个 DataFrame 在列上删除 DataFrame 中的行? [英] How to remove rows in DataFrame on column based on another DataFrame?

查看：32 发布时间：2021/11/14 22:09:11 apache-spark pyspark apache-spark-sql

本文介绍了如何基于另一个 DataFrame 在列上删除 DataFrame 中的行?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用 SQLContext.subtract() 在 Spark 1.6.1 中基于另一个数据框中的列从数据框中删除行.举个例子:

I'm trying to use SQLContext.subtract() in Spark 1.6.1 to remove rows from a dataframe based on a column from another dataframe. Let's use an example:

from pyspark.sql import Row

df1 = sqlContext.createDataFrame([
    Row(name='Alice', age=2),
    Row(name='Bob', age=1),
]).alias('df1')

df2 = sqlContext.createDataFrame([
    Row(name='Bob'),
])

df1_with_df2 = df1.join(df2, 'name').select('df1.*')
df1_without_df2 = df1.subtract(df1_with_df2)

因为我想要 df1 中不包含 name='Bob' 的所有行，所以我希望 Row(age=2, name='Alice').但我也找回了鲍勃:

Since I want all rows from df1 which don't include name='Bob' I expect Row(age=2, name='Alice'). But I also retrieve Bob:

print(df1_without_df2.collect())
# [Row(age='1', name='Bob'), Row(age='2', name='Alice')]

经过各种实验以了解这个 MCVE，我发现问题出在 age代码>键.如果我省略它:


After various experiments to get down to this MCVE, I found out that the issue is with the age key. If I omit it:
df1_noage = sqlContext.createDataFrame([
    Row(name='Alice'),
    Row(name='Bob'),
]).alias('df1_noage')

df1_noage_with_df2 = df1_noage.join(df2, 'name').select('df1_noage.*')
df1_noage_without_df2 = df1_noage.subtract(df1_noage_with_df2)
print(df1_noage_without_df2.collect())
# [Row(name='Alice')]

然后我只按预期得到爱丽丝.我所做的最奇怪的观察是可以添加键，只要它们在之后(在字典顺序意义上)我在连接中使用的键:
Then I only get Alice as expected. The weirdest observation I made is that it's possible to add keys, as long as they're after (in the lexicographical order sense) the key I use in the join:
df1_zage = sqlContext.createDataFrame([
    Row(zage=2, name='Alice'),
    Row(zage=1, name='Bob'),
]).alias('df1_zage')

df1_zage_with_df2 = df1_zage.join(df2, 'name').select('df1_zage.*')
df1_zage_without_df2 = df1_zage.subtract(df1_zage_with_df2)
print(df1_zage_without_df2.collect())
# [Row(name='Alice', zage=2)]

我正确地得到了 Alice(和她的 zage)！在我的真实示例中，我对所有列都感兴趣，而不仅仅是 name 之后的那些.
I correctly get Alice (with her zage)! In my real examples, I'm interested in all columns, not only the ones that are after name.
推荐答案
嗯，这里有一些错误(第一个问题看起来与 SPARK-6231) 和 JIRA 看起来是个好主意，但是 SUBTRACT/EXCEPT不是部分匹配的正确选择.
Well there are some bugs here (the first issue looks like related to to the same problem as SPARK-6231) and JIRA looks like a good idea, but SUBTRACT / EXCEPT is no the right choice for partial matches.
相反，从 Spark 2.0 开始，您可以使用反加入:
Instead, as of Spark 2.0, you can use anti-join:
df1.join(df1_with_df2, ["name"], "leftanti").show()

在 1.6 中，您可以使用标准外连接执行几乎相同的操作:
In 1.6 you can do pretty much the same thing with standard outer join:
import pyspark.sql.functions as F

ref = df1_with_df2.select("name").alias("ref")

(df1
    .join(ref, ref.name == df1.name, "leftouter")
    .filter(F.isnull("ref.name"))
    .drop(F.col("ref.name")))


                        这篇关于如何基于另一个 DataFrame 在列上删除 DataFrame 中的行?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

如何基于另一个 DataFrame 在列上删除 DataFrame 中的行? [英] How to remove rows in DataFrame on column based on another DataFrame?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何基于另一个 DataFrame 在列上删除 DataFrame 中的行? [英] How to remove rows in DataFrame on column based on another DataFrame?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭