如何使用数据集进行分组 [英] How to use dataset to groupby
问题描述
我有使用rdd的请求:
I have a request to use rdd to do so:
val test = Seq(("New York", "Jack"),
("Los Angeles", "Tom"),
("Chicago", "David"),
("Houston", "John"),
("Detroit", "Michael"),
("Chicago", "Andrew"),
("Detroit", "Peter"),
("Detroit", "George")
)
sc.parallelize(test).groupByKey().mapValues(_.toList).foreach(println)
结果是:
(纽约,清单(杰克))
(New York,List(Jack))
(底特律,名单(迈克尔,彼得,乔治))
(Detroit,List(Michael, Peter, George))
(洛杉矶,名单(汤姆))
(Los Angeles,List(Tom))
(休斯顿,李斯特(约翰))
(Houston,List(John))
(芝加哥,名单(大卫,安德鲁))
(Chicago,List(David, Andrew))
如何在spark2.0中使用数据集?
How to do it use dataset with spark2.0?
我有一种使用自定义函数的方法,但是感觉是如此复杂,没有简单的指向方法?
I have a way to use a custom function, but the feeling is so complicated, there is no simple point method?
推荐答案
我建议您先创建一个case class
as
I would suggest you to start with creating a case class
as
case class Monkey(city: String, firstName: String)
此case class
应该在主类之外定义.然后,您可以使用toDS
函数,并使用称为collect_list
的groupBy
和aggregation
函数,如下所示
This case class
should be defined outside the main class. Then you can just use toDS
function and use groupBy
and aggregation
function called collect_list
as below
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val test = Seq(("New York", "Jack"),
("Los Angeles", "Tom"),
("Chicago", "David"),
("Houston", "John"),
("Detroit", "Michael"),
("Chicago", "Andrew"),
("Detroit", "Peter"),
("Detroit", "George")
)
sc.parallelize(test)
.map(row => Monkey(row._1, row._2))
.toDS()
.groupBy("city")
.agg(collect_list("firstName") as "list")
.show(false)
您将输出为
+-----------+------------------------+
|city |list |
+-----------+------------------------+
|Los Angeles|[Tom] |
|Detroit |[Michael, Peter, George]|
|Chicago |[David, Andrew] |
|Houston |[John] |
|New York |[Jack] |
+-----------+------------------------+
您始终可以通过调用.rdd
函数来转换回RDD
You can always convert back to RDD
by just calling .rdd
function
这篇关于如何使用数据集进行分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!