pyspark中两个DataFrames列之间的区别 [英] Difference between two DataFrames columns in pyspark
本文介绍了pyspark中两个DataFrames列之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在寻找一种方法来在两个DataFrame的列中查找值的差异.例如:
I am looking for a way to find difference in values, in columns of two DataFrame. For example:
from pyspark.sql import SQLContext
sc = SparkContext()
sql_context = SQLContext(sc)
df_a = sql_context.createDataFrame([("a", 3), ("b", 5), ("c", 7)], ["name", "id"])
df_b = sql_context.createDataFrame([("a", 3), ("b", 10), ("c", 13)], ["name", "id"])
DataFrame A:
DataFrame A:
+----+---+
|name| id|
+----+---+
| a| 3|
| b| 5|
| c| 7|
+----+---+
DataFrame B:
DataFrame B:
+----+---+
|name| id|
+----+---+
| a| 3|
| b| 10|
| c| 13|
+----+---+
我的目标是在A中但不在B中的id
列元素中的list
,例如:[5, 7]
.我当时正在考虑在id
上进行联接,但是我没有找到一种很好的方法.
My goal is a list
of id
column elements that are in A but not in B, e.g: [5, 7]
. I was thinking of doing a join on id
, but I don't see a good way to do it.
天真的解决方案可能是:
Naive solution could be:
list_a = df_a.select("id").rdd.map(lambda x: x.asDict()["id"]).collect()
list_b = df_b.select("id").rdd.map(lambda x: x.asDict()["id"]).collect()
result = list(set(list_a).difference(list_b))
但是,除了最终收集之外,仅通过DataFrame操作是否可以获得一个简单的解决方案?
But, is there a simple solution that can be obtained with just DataFrame operations, except perhaps the final collect?
推荐答案
使用subtract
函数
df_a.select('id').subtract(df_b.select('id')).collect()
这篇关于pyspark中两个DataFrames列之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文