如何从SparkSQL DataFrame中的MapType列获取键和值 [英] How to get keys and values from MapType column in SparkSQL DataFrame

查看:1080
本文介绍了如何从SparkSQL DataFrame中的MapType列获取键和值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个实木复合地板文件中有数据,该文件包含2个字段:object_id: Stringalpha: Map<>.

I have data in a parquet file which has 2 fields: object_id: String and alpha: Map<>.

将其读入sparkSQL的数据帧中,其模式如下所示:

It is read into a data frame in sparkSQL and the schema looks like this:

scala> alphaDF.printSchema()
root
 |-- object_id: string (nullable = true)
 |-- ALPHA: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)

我正在使用Spark 2.0,并且尝试创建一个新的数据框,其中的列必须为object_id以及ALPHA映射的键,如object_id, key1, key2, key2, ...

I am using Spark 2.0 and I am trying to create a new data frame in which columns need to be object_id plus keys of the ALPHA map as in object_id, key1, key2, key2, ...

我首先试图查看是否至少可以这样访问地图:

I was first trying to see if I could at least access the map like this:

scala> alphaDF.map(a => a(0)).collect()
<console>:32: error: Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes) are 
supported by importing spark.implicits._  Support for serializing other
types will be added in future releases.
   alphaDF.map(a => a(0)).collect()

但不幸的是,我似乎无法弄清楚如何访问地图的键.

but unfortunately I can't seem to be able to figure out how to access the keys of the map.

有人可以告诉我一种在新数据框中获取object_id加映射键作为列名和映射值作为各自值的方法吗?

Can someone please show me a way to get the object_id plus map keys as column names and map values as respective values in a new dataframe?

推荐答案

火花> = 2.3

您可以使用map_keys函数来简化过程:

You can simplify the process using map_keys function:

import org.apache.spark.sql.functions.map_keys

还有map_values功能,但是在这里不会直接有用.

There is also map_values function, but it won't be directly useful here.

火花< 2.3

一般方法可以用几个步骤来表示.首先需要进口:

General method can be expressed in a few steps. First required imports:

import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.Row

和示例数据:

val ds = Seq(
  (1, Map("foo" -> (1, "a"), "bar" -> (2, "b"))),
  (2, Map("foo" -> (3, "c"))),
  (3, Map("bar" -> (4, "d")))
).toDF("id", "alpha")

要提取密钥,我们可以使用UDF(Spark< 2.3)

To extract keys we can use UDF (Spark < 2.3)

val map_keys = udf[Seq[String], Map[String, Row]](_.keys.toSeq)

或内置函数

import org.apache.spark.sql.functions.map_keys

val keysDF = df.select(map_keys($"alpha"))

找到不同的人

val distinctKeys = keysDF.as[Seq[String]].flatMap(identity).distinct
  .collect.sorted

您还可以使用explode概括keys提取:

You can also generalize keys extraction with explode:

import org.apache.spark.sql.functions.explode

val distinctKeys = df
  // Flatten the column into key, value columns
 .select(explode($"alpha"))
 .select($"key")
 .as[String].distinct
 .collect.sorted

select:

ds.select($"id" +: distinctKeys.map(x => $"alpha".getItem(x).alias(x)): _*)

这篇关于如何从SparkSQL DataFrame中的MapType列获取键和值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆