如何在 DataFrame 中展开数组(来自 JSON)? [英] How to unwind array in DataFrame (from JSON)?

查看:34
本文介绍了如何在 DataFrame 中展开数组(来自 JSON)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

RDD 中的每条记录都包含一个 json.我正在使用 SQLContext 从 Json 创建一个 DataFrame,如下所示:

Each record in an RDD contains a json. I'm using SQLContext to create a DataFrame from the Json like this:

val signalsJsonRdd = sqlContext.jsonRDD(signalsJson)

下面是架构.datapayload 是一个项目数组.我想分解项目数组以获得一个数据帧,其中每一行都是来自数据有效负载的一个项目.我试图根据 this 答案做一些事情,但它似乎我需要在 case Row(arr: Array[...]) 语句中对项目的整个结构进行建模.我可能错过了一些东西.

Below is the schema. datapayload is an array of items. I want to explode the array of items to get a dataframe where each row is an item from datapayload. I tried to do something based on this answer, but it seems that I would need to model the entire structure of the item in the case Row(arr: Array[...]) statement. I'm probably missing something.

val payloadDfs = signalsJsonRdd.explode($"data.datapayload"){ 
    case org.apache.spark.sql.Row(arr: Array[String]) =>  arr.map(Tuple1(_)) 
}

上面的代码抛出scala.MatchError,因为实际Row的类型和Row(arr: Array[String])差别很大.可能有一种简单的方法可以做我想做的事,但我找不到.请帮忙.

The above code throws a scala.MatchError, because the type of the actual Row is very different from Row(arr: Array[String]). There is probably a simple way to do what I want, but I can't find it. Please help.

架构如下

signalsJsonRdd.printSchema()

root
 |-- _corrupt_record: string (nullable = true)
 |-- data: struct (nullable = true)
 |    |-- dataid: string (nullable = true)
 |    |-- datapayload: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Reading: struct (nullable = true)
 |    |    |    |    |-- A2DPActive: boolean (nullable = true)
 |    |    |    |    |-- Accuracy: double (nullable = true)
 |    |    |    |    |-- Active: boolean (nullable = true)
 |    |    |    |    |-- Address: string (nullable = true)
 |    |    |    |    |-- Charging: boolean (nullable = true)
 |    |    |    |    |-- Connected: boolean (nullable = true)
 |    |    |    |    |-- DeviceName: string (nullable = true)
 |    |    |    |    |-- Guid: string (nullable = true)
 |    |    |    |    |-- HandsFree: boolean (nullable = true)
 |    |    |    |    |-- Header: double (nullable = true)
 |    |    |    |    |-- Heading: double (nullable = true)
 |    |    |    |    |-- Latitude: double (nullable = true)
 |    |    |    |    |-- Longitude: double (nullable = true)
 |    |    |    |    |-- PositionSource: long (nullable = true)
 |    |    |    |    |-- Present: boolean (nullable = true)
 |    |    |    |    |-- Radius: double (nullable = true)
 |    |    |    |    |-- SSID: string (nullable = true)
 |    |    |    |    |-- SSIDLength: long (nullable = true)
 |    |    |    |    |-- SpeedInKmh: double (nullable = true)
 |    |    |    |    |-- State: string (nullable = true)
 |    |    |    |    |-- Time: string (nullable = true)
 |    |    |    |    |-- Type: string (nullable = true)
 |    |    |    |-- Time: string (nullable = true)
 |    |    |    |-- Type: string (nullable = true)

推荐答案

tl;dr explode 函数是你的朋友(或者我最喜欢的flatMap).

tl;dr explode function is your friend (or my favorite flatMap).

explode 函数为给定数组或映射列中的每个元素创建一个新行.

explode function creates a new row for each element in the given array or map column.

类似下面的内容应该可以工作:

Something like the following should work:

signalsJsonRdd.withColumn("element", explode($"data.datapayload"))

参见函数 对象.

这篇关于如何在 DataFrame 中展开数组(来自 JSON)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆