如何将每行JSON解析为Spark 2 DataFrame的列? [英] How to parse each row JSON to columns of Spark 2 DataFrame?

查看:376
本文介绍了如何将每行JSON解析为Spark 2 DataFrame的列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的Spark(2.2)DataFrame中,每一行都是JSON:

In my Spark (2.2) DataFrame each row is JSON:

df.head()
//output
//[{"key":"111","event_name":"page-visited","timestamp":1517814315}]

df.show()
//output
//+--------------+
//|         value|
//+--------------+
//|{"key":"111...|
//|{"key":"222...|

我想将每个JSON行传递给各列,以获取此result:

I want to pass each JSON row to columns in order to get this result:

key   event_name     timestamp
111   page-visited   1517814315
...

我尝试了这种方法,但是并没有给我预期的结果:

I tried this approach, but it does not give me an expected result:

import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types._

val schema = StructType(Seq(
     StructField("key", StringType, true), StructField("event_name", StringType, true), StructField("timestamp", IntegerType, true)
))

val result = df.withColumn("value", from_json($"value", schema))

和:

result.printSchema()
root
 |-- value: struct (nullable = true)
 |    |-- key: string (nullable = true)
 |    |-- event_name: string (nullable = true)
 |    |-- timestamp: integer (nullable = true)

应为:

result.printSchema()
root
 |-- key: string (nullable = true)
 |-- event_name: string (nullable = true)
 |-- timestamp: integer (nullable = true)

推荐答案

您可以最后使用select($"value.*")来将struct列的元素选择为单独的列

You can use select($"value.*") in the end to select the elements of struct column into separate columns as

val result = df.withColumn("value", from_json($"value", schema)).select($"value.*")

这篇关于如何将每行JSON解析为Spark 2 DataFrame的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆