如何在DataFrame中展开数组(从JSON)? [英] How to unwind array in DataFrame (from JSON)?
问题描述
RDD中的每个记录都包含一个json.我正在使用SQLContext这样从Json创建DataFrame:
Each record in an RDD contains a json. I'm using SQLContext to create a DataFrame from the Json like this:
val signalsJsonRdd = sqlContext.jsonRDD(signalsJson)
下面是模式. datapayload是一组项目.我想分解项目数组以获取一个数据框,其中每一行都是datapayload中的一个项目.我试图根据此答案来做某事,但是它似乎我需要在 case Row(arr:Array [...])语句中为项目的整个结构建模.我可能错过了一些东西.
Below is the schema. datapayload is an array of items. I want to explode the array of items to get a dataframe where each row is an item from datapayload. I tried to do something based on this answer, but it seems that I would need to model the entire structure of the item in the case Row(arr: Array[...]) statement. I'm probably missing something.
val payloadDfs = signalsJsonRdd.explode($"data.datapayload"){
case org.apache.spark.sql.Row(arr: Array[String]) => arr.map(Tuple1(_))
}
上面的代码引发了scala.MatchError,因为实际Row的类型与Row(arr:Array [String])完全不同.可能有一种简单的方法可以执行我想要的操作,但是我找不到它.请帮忙.
The above code throws a scala.MatchError, because the type of the actual Row is very different from Row(arr: Array[String]). There is probably a simple way to do what I want, but I can't find it. Please help.
下面给出的模式
signalsJsonRdd.printSchema()
root
|-- _corrupt_record: string (nullable = true)
|-- data: struct (nullable = true)
| |-- dataid: string (nullable = true)
| |-- datapayload: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Reading: struct (nullable = true)
| | | | |-- A2DPActive: boolean (nullable = true)
| | | | |-- Accuracy: double (nullable = true)
| | | | |-- Active: boolean (nullable = true)
| | | | |-- Address: string (nullable = true)
| | | | |-- Charging: boolean (nullable = true)
| | | | |-- Connected: boolean (nullable = true)
| | | | |-- DeviceName: string (nullable = true)
| | | | |-- Guid: string (nullable = true)
| | | | |-- HandsFree: boolean (nullable = true)
| | | | |-- Header: double (nullable = true)
| | | | |-- Heading: double (nullable = true)
| | | | |-- Latitude: double (nullable = true)
| | | | |-- Longitude: double (nullable = true)
| | | | |-- PositionSource: long (nullable = true)
| | | | |-- Present: boolean (nullable = true)
| | | | |-- Radius: double (nullable = true)
| | | | |-- SSID: string (nullable = true)
| | | | |-- SSIDLength: long (nullable = true)
| | | | |-- SpeedInKmh: double (nullable = true)
| | | | |-- State: string (nullable = true)
| | | | |-- Time: string (nullable = true)
| | | | |-- Type: string (nullable = true)
| | | |-- Time: string (nullable = true)
| | | |-- Type: string (nullable = true)
推荐答案
tl; dr explode
函数是您的朋友(或我最喜欢的flatMap
).
tl;dr explode
function is your friend (or my favorite flatMap
).
explode
函数为给定数组或映射列中的每个元素创建一个新行.
explode
function creates a new row for each element in the given array or map column.
类似以下的方法应该起作用:
Something like the following should work:
signalsJsonRdd.withColumn("element", explode($"data.datapayload"))
请参见功能对象.
这篇关于如何在DataFrame中展开数组(从JSON)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!