如何在基于另一个DataFrame的列上删除DataFrame中的行? [英] How to remove rows in DataFrame on column based on another DataFrame?
问题描述
I'm trying to use SQLContext.subtract() in Spark 1.6.1 to remove rows from a dataframe based on a column from another dataframe. Let's use an example:
from pyspark.sql import Row
df1 = sqlContext.createDataFrame([
Row(name='Alice', age=2),
Row(name='Bob', age=1),
]).alias('df1')
df2 = sqlContext.createDataFrame([
Row(name='Bob'),
])
df1_with_df2 = df1.join(df2, 'name').select('df1.*')
df1_without_df2 = df1.subtract(df1_with_df2)
因为我希望所有来自df1
的行均不包含name='Bob'
,所以我希望Row(age=2, name='Alice')
.但是我也找了鲍勃:
Since I want all rows from df1
which don't include name='Bob'
I expect Row(age=2, name='Alice')
. But I also retrieve Bob:
print(df1_without_df2.collect())
# [Row(age='1', name='Bob'), Row(age='2', name='Alice')]
经过各种尝试以了解此 MCVE 后,我发现问题出在age
键上.如果我忽略它:
After various experiments to get down to this MCVE, I found out that the issue is with the age
key. If I omit it:
df1_noage = sqlContext.createDataFrame([
Row(name='Alice'),
Row(name='Bob'),
]).alias('df1_noage')
df1_noage_with_df2 = df1_noage.join(df2, 'name').select('df1_noage.*')
df1_noage_without_df2 = df1_noage.subtract(df1_noage_with_df2)
print(df1_noage_without_df2.collect())
# [Row(name='Alice')]
那我只能按预期得到爱丽丝.我最奇怪的发现是,可以添加键,只要它们是 之后(按字典顺序),就可以在联接中使用该键:
Then I only get Alice as expected. The weirdest observation I made is that it's possible to add keys, as long as they're after (in the lexicographical order sense) the key I use in the join:
df1_zage = sqlContext.createDataFrame([
Row(zage=2, name='Alice'),
Row(zage=1, name='Bob'),
]).alias('df1_zage')
df1_zage_with_df2 = df1_zage.join(df2, 'name').select('df1_zage.*')
df1_zage_without_df2 = df1_zage.subtract(df1_zage_with_df2)
print(df1_zage_without_df2.collect())
# [Row(name='Alice', zage=2)]
我正确地得到了爱丽丝(和她的zage)!在我的实际示例中,我对所有列都感兴趣,而不仅是name
之后的那些列.
I correctly get Alice (with her zage)! In my real examples, I'm interested in all columns, not only the ones that are after name
.
推荐答案
Well there are some bugs here (the first issue looks like related to to the same problem as SPARK-6231) and JIRA looks like a good idea, but SUBTRACT
/ EXCEPT
is no the right choice for partial matches.
相反,从 Spark 2.0 开始,您可以使用反连接:
Instead, as of Spark 2.0, you can use anti-join:
df1.join(df1_with_df2, ["name"], "leftanti").show()
在1.6中,您可以使用标准外部联接执行几乎相同的操作:
In 1.6 you can do pretty much the same thing with standard outer join:
import pyspark.sql.functions as F
ref = df1_with_df2.select("name").alias("ref")
(df1
.join(ref, ref.name == df1.name, "leftouter")
.filter(F.isnull("ref.name"))
.drop(F.col("ref.name")))
这篇关于如何在基于另一个DataFrame的列上删除DataFrame中的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!