Spark Scala:将任意 N 列转换为 Map [英] Spark Scala: convert arbitrary N columns into Map

查看:38
本文介绍了Spark Scala:将任意 N 列转换为 Map的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据结构,表示电影 ID(第一列)和其余列中不同用户对该电影的评分 - 类似这样:

I have the following data structure representing movie ids (first column) and ratings for different users for that movie in the rest of columns - something like that:

+-------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
|movieId|   1|   2|   3|   4|   5|   6|   7|   8|   9|  10|  11|  12|  13|  14|  15|
+-------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
|   1580|null|null| 3.5| 5.0|null|null|null|null|null|null|null|null|null|null|null|
|   3175|null|null|null|null|null|null|null|null|null|null|null|null|null| 5.0|null|
|   3794|null|null|null|null|null|null|null|null|null|null|null| 3.0|null|null|null|
|   2659|null|null|null| 3.0|null|null|null|null|null|null|null|null|null|null|null|

我想把这个DataFrame转成一个DataSet

I want to convert this DataFrame to a DataSet of

final case class MovieRatings(movie_id: Long, rating: Map[Long, Double])

所以它会是这样的

[1580, [1 -> null, 2 -> null, 3 -> 3.5, 4 -> 5.0, 5 -> null, 6 -> null, 7 -> null,...]]

等等

如何做到这一点?

这里的问题是用户数量是任意的.我想将它们压缩到一个单独的列中,使第一列保持不变.

The thing here is that number of users is arbitrary. And I want to zip those into a single column leaving the first column untouched.

推荐答案

首先,您必须将 DataFrame 转换为与您的案例类匹配的架构,然后您可以使用 .as[MovieRatings] 将 DataFrame 转换为 Dataset[MovieRatings]:

First, you have to tranform your DataFrame into one with a schema matching your case class, then you can use .as[MovieRatings] to convert DataFrame into a Dataset[MovieRatings]:

import org.apache.spark.sql.functions._
import spark.implicits._

// define a new MapType column using `functions.map`, passing a flattened-list of
// column name (as a Long column) and column value
val mapColumn: Column = map(df.columns.tail.flatMap(name => Seq(lit(name.toLong), $"$name")): _*)

// select movie id and map column with names matching the case class, and convert to Dataset:
df.select($"movieId" as "movie_id", mapColumn as "ratings")
  .as[MovieRatings]
  .show(false)

这篇关于Spark Scala:将任意 N 列转换为 Map的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆