使用 Spark SQL GROUP BY 对 DataFrame 进行高效的 PairRDD 操作 [英] Efficient PairRDD operations on DataFrame with Spark SQL GROUP BY

查看：33 发布时间：2021/11/14 22:42:48 scala apache-spark apache-spark-sql rdd

本文介绍了使用 Spark SQL GROUP BY 对 DataFrame 进行高效的 PairRDD 操作的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这个问题是关于聚合操作时DataFrame 和RDD 之间的二元性.在 Spark SQL 中，可以使用表生成 UDF 进行自定义聚合，但创建其中之一通常明显不如使用可用于 RDD 的聚合函数对用户友好，尤其是在不需要表输出的情况下.

This question is about the duality between DataFrame and RDD when it comes to aggregation operations. In Spark SQL one can use table generating UDFs for custom aggregations but creating one of those is typically noticeably less user-friendly than using the aggregation functions available for RDDs, especially if table output is not required.

是否有一种有效的方法可以将诸如 aggregateByKey 之类的配对 RDD 操作应用于使用 GROUP BY 分组或使用 ORDERED BY 排序的 DataFrame?

Is there an efficient way to apply pair RDD operations such as aggregateByKey to a DataFrame which has been grouped using GROUP BY or ordered using ORDERED BY?

通常，需要一个显式的 map 步骤来创建键值元组，例如 dataFrame.rdd.map(row => (row.getString(row.fieldIndex("category")), row).aggregateByKey(...).这能避免吗?

Normally, one would need an explicit map step to create key-value tuples, e.g., dataFrame.rdd.map(row => (row.getString(row.fieldIndex("category")), row).aggregateByKey(...). Can this be avoided?

推荐答案

并非如此.虽然 DataFrames 可以转换为 RDDs ，反之亦然，但这是相对复杂的操作，并且像 DataFrame.groupBy 这样的方法没有相同的语义RDD 上的对应项.

Not really. While DataFrames can be converted to RDDs and vice versa this is relatively complex operation and methods like DataFrame.groupBy don't have the same semantics as their counterparts on RDD.

最接近的是 Spark 1.6 中引入的新 DataSet API.0.它提供了与 DataFrames 和 GroupedDataset 类的更紧密的集成，以及它自己的一组方法，包括 reduce、cogroup 或mapGroups:

The closest thing you can get is a new DataSet API introduced in Spark 1.6.0. It provides a much closer integration with DataFrames and GroupedDataset class with its own set of methods including reduce, cogroup or mapGroups:

case class Record(id: Long, key: String, value: Double)

val df = sc.parallelize(Seq(
    (1L, "foo", 3.0), (2L, "bar", 5.6),
    (3L, "foo", -1.0), (4L, "bar", 10.0)
)).toDF("id", "key", "value")

val ds = df.as[Record]
ds.groupBy($"key").reduce((x, y) => if (x.id < y.id) x else y).show

// +-----+-----------+
// |   _1|         _2|
// +-----+-----------+
// |[bar]|[2,bar,5.6]|
// |[foo]|[1,foo,3.0]|
// +-----+-----------+

在某些特定情况下，可以利用 Orderable 语义来使用 structs 或 arrays 对数据进行分组和处理.您将在 SPARK DataFrame:选择每组的第一行

In some specific cases it is possible to leverage Orderable semantics to group and process data using structs or arrays. You'll find an example in SPARK DataFrame: select the first row of each group

这篇关于使用 Spark SQL GROUP BY 对 DataFrame 进行高效的 PairRDD 操作的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 Spark SQL GROUP BY 对 DataFrame 进行高效的 PairRDD 操作 [英] Efficient PairRDD operations on DataFrame with Spark SQL GROUP BY

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用 Spark SQL GROUP BY 对 DataFrame 进行高效的 PairRDD 操作 [英] Efficient PairRDD operations on DataFrame with Spark SQL GROUP BY

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭