pyspark中两个DataFrames列之间的差异 [英] Difference between two DataFrames columns in pyspark

查看：23 发布时间：2021/11/14 22:16:37 pyspark apache-spark-sql

本文介绍了pyspark中两个DataFrames列之间的差异的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在寻找一种在两个 DataFrame 的列中查找值差异的方法.例如:

I am looking for a way to find difference in values, in columns of two DataFrame. For example:

from pyspark.sql import SQLContext

sc = SparkContext()
sql_context = SQLContext(sc)

df_a = sql_context.createDataFrame([("a", 3), ("b", 5), ("c", 7)], ["name", "id"])

df_b = sql_context.createDataFrame([("a", 3), ("b", 10), ("c", 13)], ["name", "id"])

数据帧 A:

+----+---+
|name| id|
+----+---+
|   a|  3|
|   b|  5|
|   c|  7|
+----+---+

数据帧 B:

+----+---+
|name| id|
+----+---+
|   a|  3|
|   b| 10|
|   c| 13|
+----+---+

我的目标是在 A 中但不在 B 中的 id 列元素的 list，例如:[5, 7].我正在考虑对 id 进行连接，但我没有找到一个好的方法.

My goal is a list of id column elements that are in A but not in B, e.g: [5, 7]. I was thinking of doing a join on id, but I don't see a good way to do it.

简单的解决方案可能是:

Naive solution could be:

list_a = df_a.select("id").rdd.map(lambda x: x.asDict()["id"]).collect()
list_b = df_b.select("id").rdd.map(lambda x: x.asDict()["id"]).collect()

result = list(set(list_a).difference(list_b))

但是，是否有一个简单的解决方案可以仅通过 DataFrame 操作获得，也许除了最终收集之外?

But, is there a simple solution that can be obtained with just DataFrame operations, except perhaps the final collect?

pyspark中两个DataFrames列之间的差异 [英] Difference between two DataFrames columns in pyspark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

pyspark中两个DataFrames列之间的差异 [英] Difference between two DataFrames columns in pyspark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭