使用 Spark-Scala 解析列中的 JSON 根 [英] Parse JSON root in a column using Spark-Scala

查看:26
本文介绍了使用 Spark-Scala 解析列中的 JSON 根的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了问题,以便将 JSOM 的根转换为数据框中的记录,以获取不确定数量的记录.

I've problems in order to transform the root of a JSOM a record in a data frame for an undetermined number of records.

我使用类似于以下的 JSON 生成了一个数据框:

I've a data frame generated with a JSON similar the following:

val exampleJson = spark.createDataset(
  """
  {"ITEM1512":
        {"name":"Yin",
         "address":{"city":"Columbus",
                    "state":"Ohio"}
                    }, 
    "ITEM1518":
        {"name":"Yang",
         "address":{"city":"Working",
                    "state":"Marc"}
                    }
  }""" :: Nil)

当我按照以下说明阅读时

When I read it whit the following instruction

val itemsExample = spark.read.json(exampleJson)

生成的Schema和Data Frame如下:

The Schema and Data Frame generated is the following:

+-----------------------+-----------------------+
|ITEM1512               |ITEM1518               |
+-----------------------+-----------------------+
|[[Columbus, Ohio], Yin]|[[Working, Marc], Yang]|
+-----------------------+-----------------------+

root
 |-- ITEM1512: struct (nullable = true)
 |    |-- address: struct (nullable = true)
 |    |    |-- city: string (nullable = true)
 |    |    |-- state: string (nullable = true)
 |    |-- name: string (nullable = true)
 |-- ITEM1518: struct (nullable = true)
 |    |-- address: struct (nullable = true)
 |    |    |-- city: string (nullable = true)
 |    |    |-- state: string (nullable = true)
 |    |-- name: string (nullable = true)

但我想生成这样的东西:

But i want to generate something like this:

+-----------------------+-----------------------+
|Item                   |Values                 |
+-----------------------+-----------------------+
|ITEM1512               |[[Columbus, Ohio], Yin]|
|ITEM1518               |[[Working, Marc], Yang]|
+-----------------------+-----------------------+

因此,为了解析此 JSON 数据,我需要读取所有列并将其添加到数据框中的记录中,因为我作为示例编写的不仅仅是这两个项目.事实上,我想在数据框中添加数百万个项目.

So, in order to parse this JSON data I need to read all the columns and added it to a record in the Data Frame, because there are more than this two items that i write as example. In fact, there are millions of items that I'd like to add in a Data Frame.

我正在尝试复制此处找到的解决方案:如何使用 Spark-Scala 解析 JSON 数据使用此代码:

I'm trying to replicate the solution found here in: How to parse the JSON data using Spark-Scala with this code:

val columns:Array[String]       = itemsExample.columns
var arrayOfDFs:Array[DataFrame] = Array() 

for(col_name <- columns){

  val temp = itemsExample.selectExpr("explode("+col_name+") as element")
    .select(
      lit(col_name).as("Item"),
      col("element.E").as("Value"))

  arrayOfDFs = arrayOfDFs :+ temp
}

val jsonDF = arrayOfDFs.reduce(_ union _)
jsonDF.show(false)

但是我在阅读另一个问题的示例中遇到了这个问题,根在数组中,在我的情况下,根是 StrucType.因此抛出下一个异常:

But I face with the problem while in the example reading in the other question the root is in array in my case the root is an StrucType. Therefore the next exception is thrown:

org.apache.spark.sql.AnalysisException:无法解析'explode(ITEM1512)' 由于数据类型不匹配:函数输入爆炸应该是数组或地图类型,而不是结构,名称:字符串>

org.apache.spark.sql.AnalysisException: cannot resolve 'explode(ITEM1512)' due to data type mismatch: input to function explode should be array or map type, not struct,name:string>

推荐答案

您可以使用 stack 函数.

示例:

itemsExample.selectExpr("""stack(2,'ITEM1512',ITEM1512,'ITEM1518',ITEM1518) as (Item,Values)""").
show(false)
//+--------+-----------------------+
//|Item    |Values                 |
//+--------+-----------------------+
//|ITEM1512|[[Columbus, Ohio], Yin]|
//|ITEM1518|[[Working, Marc], Yang]|
//+--------+-----------------------+

<小时>

更新:

动态堆栈查询:

val stack=df.columns.map(x => s"'${x}',${x}").mkString(s"stack(${df.columns.size},",",",")as (Item,Values)")
//stack(2,'ITEM1512',ITEM1512,'ITEM1518',ITEM1518) as (Item,Values)

itemsExample.selectExpr(stack).show()
//+--------+-----------------------+
//|Item    |Values                 |
//+--------+-----------------------+
//|ITEM1512|[[Columbus, Ohio], Yin]|
//|ITEM1518|[[Working, Marc], Yang]|
//+--------+-----------------------+

这篇关于使用 Spark-Scala 解析列中的 JSON 根的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆