使用 Spark SQL GROUP BY 对 DataFrame 进行高效的 PairRDD 操作 [英] Efficient PairRDD operations on DataFrame with Spark SQL GROUP BY
问题描述
这个问题是关于聚合操作时DataFrame
和RDD
之间的二元性.在 Spark SQL 中,可以使用表生成 UDF 进行自定义聚合,但创建其中之一通常明显不如使用可用于 RDD 的聚合函数对用户友好,尤其是在不需要表输出的情况下.
This question is about the duality between DataFrame
and RDD
when it comes to aggregation operations. In Spark SQL one can use table generating UDFs for custom aggregations but creating one of those is typically noticeably less user-friendly than using the aggregation functions available for RDDs, especially if table output is not required.
是否有一种有效的方法可以将诸如 aggregateByKey
之类的配对 RDD 操作应用于使用 GROUP BY 分组或使用 ORDERED BY 排序的 DataFrame?
Is there an efficient way to apply pair RDD operations such as aggregateByKey
to a DataFrame which has been grouped using GROUP BY or ordered using ORDERED BY?
通常,需要一个显式的 map
步骤来创建键值元组,例如 dataFrame.rdd.map(row => (row.getString(row.fieldIndex("category")), row).aggregateByKey(...)
.这能避免吗?
Normally, one would need an explicit map
step to create key-value tuples, e.g., dataFrame.rdd.map(row => (row.getString(row.fieldIndex("category")), row).aggregateByKey(...)
. Can this be avoided?
推荐答案
并非如此.虽然 DataFrames
可以转换为 RDDs
,反之亦然,但这是相对复杂的操作,并且像 DataFrame.groupBy
这样的方法没有相同的语义RDD
上的对应项.
Not really. While DataFrames
can be converted to RDDs
and vice versa this is relatively complex operation and methods like DataFrame.groupBy
don't have the same semantics as their counterparts on RDD
.
最接近的是 Spark 1.6 中引入的新 DataSet
API.0.它提供了与 DataFrames
和 GroupedDataset
类的更紧密的集成,以及它自己的一组方法,包括 reduce
、cogroup
或mapGroups
:
The closest thing you can get is a new DataSet
API introduced in Spark 1.6.0. It provides a much closer integration with DataFrames
and GroupedDataset
class with its own set of methods including reduce
, cogroup
or mapGroups
:
case class Record(id: Long, key: String, value: Double)
val df = sc.parallelize(Seq(
(1L, "foo", 3.0), (2L, "bar", 5.6),
(3L, "foo", -1.0), (4L, "bar", 10.0)
)).toDF("id", "key", "value")
val ds = df.as[Record]
ds.groupBy($"key").reduce((x, y) => if (x.id < y.id) x else y).show
// +-----+-----------+
// | _1| _2|
// +-----+-----------+
// |[bar]|[2,bar,5.6]|
// |[foo]|[1,foo,3.0]|
// +-----+-----------+
在某些特定情况下,可以利用 Orderable
语义来使用 structs
或 arrays
对数据进行分组和处理.您将在 SPARK DataFrame:选择每组的第一行
In some specific cases it is possible to leverage Orderable
semantics to group and process data using structs
or arrays
. You'll find an example in SPARK DataFrame: select the first row of each group
这篇关于使用 Spark SQL GROUP BY 对 DataFrame 进行高效的 PairRDD 操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!