Apache Spark中的DataFrame相等性 [英] DataFrame equality in Apache Spark

查看:115
本文介绍了Apache Spark中的DataFrame相等性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设df1df2是Apache Spark中的两个DataFrame,它们使用两种不同的机制进行计算,例如Spark SQL与Scala/Java/Python API.

是否有一种惯用的方法来确定两个数据帧是否等效(相等,同构),其中等效性由数据(每行的列名和列值)相同来确定,除了行和列的排序.列?

该问题的动机是,通常有很多方法来计算一些大数据结果,每种方法都有其自身的取舍.在探索这些折衷方案时,重要的是要保持正确性,因此需要在有意义的测试数据集上检查等效性/相等性.

解决方案

Apache Spark测试套件中有一些标准方法,但是其中大多数涉及本地收集数据,如果要对大型DataFrame执行相等性测试,则那可能是不合适的解决方案.

首先检查架构,然后可以对df3进行交集,并验证df1,df2和amp的计数. df3都相等(但是,只有在没有重复的行的情况下该方法才有效,如果有不同的重复行,则此方法仍然可以返回true).

另一种选择是获取两个DataFrame的基础RDD,映射到(Row,1),执行reduceByKey来计算每个Row的数量,然后将两个结果RDD进行分组,然后进行常规聚合并如果任何迭代器不相等,则返回false.

Assume df1 and df2 are two DataFrames in Apache Spark, computed using two different mechanisms, e.g., Spark SQL vs. the Scala/Java/Python API.

Is there an idiomatic way to determine whether the two data frames are equivalent (equal, isomorphic), where equivalence is determined by the data (column names and column values for each row) being identical save for the ordering of rows & columns?

The motivation for the question is that there are often many ways to compute some big data result, each with its own trade-offs. As one explores these trade-offs, it is important to maintain correctness and hence the need to check for the equivalence/equality on a meaningful test data set.

解决方案

There are some standard ways in the Apache Spark test suites, however most of these involve collecting the data locally and if you want to do equality testing on large DataFrames then that is likely not a suitable solution.

Checking the schema first and then you could do an intersection to df3 and verify that the count of df1,df2 & df3 are all equal (however this only works if there aren't duplicate rows, if there are different duplicates rows this method could still return true).

Another option would be getting the underlying RDDs of both of the DataFrames, mapping to (Row, 1), doing a reduceByKey to count the number of each Row, and then cogrouping the two resulting RDDs and then do a regular aggregate and return false if any of the iterators are not equal.

这篇关于Apache Spark中的DataFrame相等性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆