Spark DataFrame:对组进行操作 [英] Spark DataFrame: operate on groups

查看：137 发布时间：2020/9/4 8:02:22 scala dataframe apache-spark grouping

本文介绍了Spark DataFrame:对组进行操作的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个正在使用的DataFrame，我想按一组列进行分组，并在其余列上按组进行操作.在常规的RDD -land中，我认为它看起来像这样:

I've got a DataFrame I'm operating on, and I want to group by a set of columns and operate per-group on the rest of the columns. In regular RDD-land I think it would look something like this:

rdd.map( tup => ((tup._1, tup._2, tup._3), tup) ).
  groupByKey().
  forEachPartition( iter => doSomeJob(iter) )

在DataFrame -land中，我将这样开始:

In DataFrame-land I'd start like this:

df.groupBy("col1", "col2", "col3")  // Reference by name

但是如果我的操作比例如，我想为每个("col1", "col2", "col3")组构建一个MongoDB文档(通过遍历该组中关联的Row)，缩小到N分区，然后将文档插入到MongoDB数据库中. N限制是我想要的最大同时连接数.

For example, I want to build a single MongoDB document per ("col1", "col2", "col3") group (by iterating through the associated Rows in the group), scale down to N partitions, then insert the docs into a MongoDB database. The N limit is the max number of simultaneous connections I want.

有什么建议吗?

推荐答案

您可以进行自我加入.首先获取组:

You can do a self-join. First get the groups:

val groups = df.groupBy($"col1", $"col2", $"col3").agg($"col1", $"col2", $"col3")

然后，您可以将其重新加入到原始DataFrame中:

Then you can join this back to the original DataFrame:

val joinedDF = groups
  .select($"col1" as "l_col1", $"col2" as "l_col2", $"col3" as "l_col3)
  .join(df, $"col1" <=> $"l_col1" and $"col2" <=> $"l_col2" and  $"col3" <=> $"l_col3")

虽然这将为您提供与原始数据完全相同的数据(并带有3个额外的冗余列)，但您可以执行另一个联接以为与(col1，col2，col3)组关联的MongoDB文档ID添加一列行.

While this gets you exactly the same data you had originally (and with 3 additional, redundant columns) you could do another join to add a column with the MongoDB document ID for the (col1, col2, col3) group associated with the row.

无论如何，以我的经验，联接和自联接是您处理DataFrames中复杂内容的方式.

At any rate, in my experience joins and self-joins are the way you handle complicated stuff in DataFrames.

这篇关于Spark DataFrame:对组进行操作的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark DataFrame:对组进行操作 [英] Spark DataFrame: operate on groups

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark DataFrame:对组进行操作 [英] Spark DataFrame: operate on groups

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭