火花嵌套JSON [英] Nested json in spark

查看：213 发布时间：2016/5/22 16:03:28 scala apache-spark apache-spark-sql spark-dataframe

本文介绍了火花嵌套JSON的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有加载为数据框下面的JSON：

I have the following json loaded as a Dataframe:

root
 |-- data: struct (nullable = true)
 |    |-- field1: string (nullable = true)
 |    |-- field2: string (nullable = true)
 |-- moreData: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- more1: string (nullable = true)
 |    |    |-- more2: string (nullable = true)
 |    |    |-- more3: string (nullable = true)

我想从这个数据框得到如下RDD：

I want to get the following RDD from this Dataframe:

RDD[(more1, more2, more3, field1, field2)]

我怎样才能做到这一点？我想，我必须使用 flatMap 的嵌套JSON？

推荐答案

的组合爆炸和点语法应该做的伎俩：

A combination of explode and dot syntax should do the trick:

import org.apache.spark.sql.functions.explode

case class Data(field1: String, field2: String)
case class MoreData(more1: String, more2: String, more3: String)

val df = sc.parallelize(Seq(
  (Data("foo", "bar"), Array(MoreData("a", "b", "c"), MoreData("d", "e", "f")))
)).toDF("data", "moreData")

df.printSchema
// root
//  |-- data: struct (nullable = true)
//  |    |-- field1: string (nullable = true)
//  |    |-- field2: string (nullable = true)
//  |-- moreData: array (nullable = true)
//  |    |-- element: struct (containsNull = true)
//  |    |    |-- more1: string (nullable = true)
//  |    |    |-- more2: string (nullable = true)
//  |    |    |-- more3: string (nullable = true)

val columns = Seq(
  $"moreData.more1", $"moreData.more2", $"moreData.more3",
  $"data.field1", $"data.field2")

val aRDD = df.withColumn("moreData", explode($"moreData"))
  .select(columns: _*)
  .rdd

aRDD.collect
// Array[org.apache.spark.sql.Row] = Array([a,b,c,foo,bar], [d,e,f,foo,bar])

根据您的要求，您可以按照此地图提取的行值：

Depending on your requirements you can follow this with map to extract values from the rows:

import org.apache.spark.sql.Row

aRDD.map{case Row(m1: String, m2: String, m3: String, f1: String, f2: String) =>
  (m1, m2, m3, f1, f2)}

又见查询复杂类型星火SQL数据帧

这篇关于火花嵌套JSON的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

火花嵌套JSON [英] Nested json in spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

火花嵌套JSON [英] Nested json in spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭