Spark Scala:将任意N列转换为Map [英] Spark Scala: convert arbitrary N columns into Map

查看:536
本文介绍了Spark Scala:将任意N列转换为Map的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我具有以下数据结构,这些数据结构在其余各列中代表该电影的不同用户的电影ID(第一列)和评分-像这样:

I have the following data structure representing movie ids (first column) and ratings for different users for that movie in the rest of columns - something like that:

+-------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
|movieId|   1|   2|   3|   4|   5|   6|   7|   8|   9|  10|  11|  12|  13|  14|  15|
+-------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
|   1580|null|null| 3.5| 5.0|null|null|null|null|null|null|null|null|null|null|null|
|   3175|null|null|null|null|null|null|null|null|null|null|null|null|null| 5.0|null|
|   3794|null|null|null|null|null|null|null|null|null|null|null| 3.0|null|null|null|
|   2659|null|null|null| 3.0|null|null|null|null|null|null|null|null|null|null|null|

我想将此DataFrame转换为

I want to convert this DataFrame to a DataSet of

final case class MovieRatings(movie_id: Long, ratings: Map[Long, Double])

这样就可以了

[1580, [1 -> null, 2 -> null, 3 -> 3.5, 4 -> 5.0, 5 -> null, 6 -> null, 7 -> null,...]]

等等.

这怎么办?

这里的问题是用户数是任意的.而且我想将它们压缩到一个单独的列中,而保持第一列不变.

The thing here is that number of users is arbitrary. And I want to zip those into a single column leaving the first column untouched.

推荐答案

首先,您必须使用与案例类匹配的架构将DataFrame转换为一个,然后可以使用.as[MovieRatings]将DataFrame转换为Dataset[MovieRatings] :

First, you have to tranform your DataFrame into one with a schema matching your case class, then you can use .as[MovieRatings] to convert DataFrame into a Dataset[MovieRatings]:

import org.apache.spark.sql.functions._
import spark.implicits._

// define a new MapType column using `functions.map`, passing a flattened-list of
// column name (as a Long column) and column value
val mapColumn: Column = map(df.columns.tail.flatMap(name => Seq(lit(name.toLong), $"$name")): _*)

// select movie id and map column with names matching the case class, and convert to Dataset:
df.select($"movieId" as "movie_id", mapColumn as "ratings")
  .as[MovieRatings]
  .show(false)

这篇关于Spark Scala:将任意N列转换为Map的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆