如何将 Spark SQL DataFrame 与 flatMap 一起使用? [英] How to use Spark SQL DataFrame with flatMap?
问题描述
我使用的是 Spark Scala API.我有一个具有以下架构的 Spark SQL DataFrame(从 Avro 文件中读取):
I am using the Spark Scala API. I have a Spark SQL DataFrame (read from an Avro file) with the following schema:
root
|-- ids: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: integer
| | |-- value: string (valueContainsNull = true)
|-- match: array (nullable = true)
| |-- element: integer (containsNull = true)
本质上是 2 列 [ids: List[Map[Int, String]],匹配:List[Int]].示例数据如下:
Essentially 2 columns [ ids: List[Map[Int, String]], match: List[Int] ]. Sample data that looks like:
[List(Map(1 -> a), Map(2 -> b), Map(3 -> c), Map(4 -> d)),List(0, 0, 1, 0)]
[List(Map(5 -> c), Map(6 -> a), Map(7 -> e), Map(8 -> d)),List(1, 0, 1, 0)]
...
我想做的是 flatMap()
每行产生 3 列 [id, property, match强>].使用上面的 2 行作为我们将得到的输入数据:
What I would like to do is flatMap()
each row to produce 3 columns [id, property, match]. Using the above 2 rows as the input data we would get:
[1,a,0]
[2,b,0]
[3,c,1]
[4,d,0]
[5,c,1]
[6,a,0]
[7,e,1]
[8,d,0]
...
然后 groupBy
String
property (ex: a, b, ...) 产生 count("property")
和 sum("match")
:
and then groupBy
the String
property (ex: a, b, ...) to produce count("property")
and sum("match")
:
a 2 0
b 1 0
c 2 2
d 2 0
e 1 1
我想做类似的事情:
val result = myDataFrame.select("ids","match").flatMap(
(row: Row) => row.getList[Map[Int,String]](1).toArray() )
result.groupBy("property").agg(Map(
"property" -> "count",
"match" -> "sum" ) )
问题在于 flatMap
将 DataFrame 转换为 RDD.有没有一种好方法可以使用 DataFrames 执行 flatMap
类型的操作,然后是 groupBy
?
The problem is that the flatMap
converts DataFrame to RDD. Is there a good way to do a flatMap
type operation followed by groupBy
using DataFrames?
推荐答案
flatMap
是您想要做什么?它将每个输入行转换为 0 行或更多行.它可以过滤掉它们,也可以添加新的.在 SQL 中,要获得与使用 join
相同的功能.你能用 join
做你想做的事吗?
What does flatMap
do that you want? It converts each input row into 0 or more rows. It can filter them out, or it can add new ones. In SQL to get the same functionality you use join
. Can you do what you want to do with a join
?
或者,您也可以查看Dataframe.explode
,它只是一种特定的join
(您可以轻松地制作自己的explode
> 通过将 DataFrame 加入 UDF).explode
将单个列作为输入,让您将其拆分或转换为多个值,然后 join
原始行回到新行.所以:
Alternatively, you could also look at Dataframe.explode
, which is just a specific kind of join
(you can easily craft your own explode
by joining a DataFrame to a UDF). explode
takes a single column as input and lets you split it or convert it into multiple values and then join
the original row back onto the new rows. So:
user groups
griffin mkt,it,admin
可能变成:
user group
griffin mkt
griffin it
griffin admin
所以我想说看看 DataFrame.explode
,如果这不能让您轻松到达那里,请尝试使用 UDF 连接.
So I would say take a look at DataFrame.explode
and if that doesn't get you there easily, try joins with UDFs.
这篇关于如何将 Spark SQL DataFrame 与 flatMap 一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!