如何将 Spark SQL DataFrame 与 flatMap 一起使用? [英] How to use Spark SQL DataFrame with flatMap?

查看:94
本文介绍了如何将 Spark SQL DataFrame 与 flatMap 一起使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是 Spark Scala API.我有一个具有以下架构的 Spark SQL DataFrame(从 Avro 文件中读取):

I am using the Spark Scala API. I have a Spark SQL DataFrame (read from an Avro file) with the following schema:

root
|-- ids: array (nullable = true)
|    |-- element: map (containsNull = true)
|    |    |-- key: integer
|    |    |-- value: string (valueContainsNull = true)
|-- match: array (nullable = true)
|    |-- element: integer (containsNull = true)

本质上是 2 列 [ids: List[Map[Int, String]],匹配:List[Int]].示例数据如下:

Essentially 2 columns [ ids: List[Map[Int, String]], match: List[Int] ]. Sample data that looks like:

[List(Map(1 -> a), Map(2 -> b), Map(3 -> c), Map(4 -> d)),List(0, 0, 1, 0)]
[List(Map(5 -> c), Map(6 -> a), Map(7 -> e), Map(8 -> d)),List(1, 0, 1, 0)]
...

我想做的是 flatMap() 每行产生 3 列 [id, property, ma​​tch].使用上面的 2 行作为我们将得到的输入数据:

What I would like to do is flatMap() each row to produce 3 columns [id, property, match]. Using the above 2 rows as the input data we would get:

[1,a,0]
[2,b,0]
[3,c,1]
[4,d,0]
[5,c,1]
[6,a,0]
[7,e,1]
[8,d,0]
...

然后 groupBy String property (ex: a, b, ...) 产生 count("property")sum("match"):

and then groupBy the String property (ex: a, b, ...) to produce count("property") and sum("match"):

 a    2    0
 b    1    0
 c    2    2
 d    2    0
 e    1    1

我想做类似的事情:

val result = myDataFrame.select("ids","match").flatMap( 
    (row: Row) => row.getList[Map[Int,String]](1).toArray() )
result.groupBy("property").agg(Map(
    "property" -> "count",
    "match" -> "sum" ) )

问题在于 flatMap 将 DataFrame 转换为 RDD.有没有一种好方法可以使用 DataFrames 执行 flatMap 类型的操作,然后是 groupBy ?

The problem is that the flatMap converts DataFrame to RDD. Is there a good way to do a flatMap type operation followed by groupBy using DataFrames?

推荐答案

flatMap 是您想要做什么?它将每个输入行转换为 0 行或更多行.它可以过滤掉它们,也可以添加新的.在 SQL 中,要获得与使用 join 相同的功能.你能用 join 做你想做的事吗?

What does flatMap do that you want? It converts each input row into 0 or more rows. It can filter them out, or it can add new ones. In SQL to get the same functionality you use join. Can you do what you want to do with a join?

或者,您也可以查看Dataframe.explode,它只是一种特定的join(您可以轻松地制作自己的explode> 通过将 DataFrame 加入 UDF).explode 将单个列作为输入,让您将其拆分或转换为多个值,然后 join 原始行回到新行.所以:

Alternatively, you could also look at Dataframe.explode, which is just a specific kind of join (you can easily craft your own explode by joining a DataFrame to a UDF). explode takes a single column as input and lets you split it or convert it into multiple values and then join the original row back onto the new rows. So:

user      groups
griffin   mkt,it,admin

可能变成:

user      group
griffin   mkt
griffin   it
griffin   admin

所以我想说看看 DataFrame.explode,如果这不能让您轻松到达那里,请尝试使用 UDF 连接.

So I would say take a look at DataFrame.explode and if that doesn't get you there easily, try joins with UDFs.

这篇关于如何将 Spark SQL DataFrame 与 flatMap 一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆