Spark Read Json:如何读取在整数和结构之间交替的字段 [英] Spark Read Json: how to read field that alternates between integer and struct
本文介绍了Spark Read Json:如何读取在整数和结构之间交替的字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
尝试将多个json文件读入数据帧,两个文件都有一个值节点,但是此节点的类型在整数和结构之间交替:
Trying to read multiple json files into a dataframe, both files have a "Value" node but the type of this node alternates between integer and struct:
文件1 :
{
"Value": 123
}
文件2:
{
"Value": {
"Value": "On",
"ValueType": "State",
"IsSystemValue": true
}
}
我的目标是将文件读入这样的数据框中:
My goal is to read the files into a dataframe like this:
|---------------------|------------------|---------------------|------------------|
| File | Value | ValueType | IsSystemValue |
|---------------------|------------------|---------------------|------------------|
| File1.json | 123 | null | null |
|---------------------|------------------|---------------------|------------------|
| File2.json | On | State | true |
|---------------------|------------------|---------------------|------------------|
有可能所有读取的文件都像FileA,而没有像FileB,反之亦然,或两者兼而有之。事先不知道。有想法吗?
There is a possibility that all of the files read are like FileA and none like FileB, vice verse, or a combination of both. Its not known ahead of time. Any Ideas??
推荐答案
尝试一下是否有帮助-
/**
* test/File1.json
* -----
* {
* "Value": 123
* }
*/
/**
* test/File2.json
* ---------
* {
* "Value": {
* "Value": "On",
* "ValueType": "State",
* "IsSystemValue": true
* }
* }
*/
val path = getClass.getResource("/test" ).getPath
val df = spark.read
.option("multiLine", true)
.json(path)
df.show(false)
df.printSchema()
/**
* +-------------------------------------------------------+
* |Value |
* +-------------------------------------------------------+
* |{"Value":"On","ValueType":"State","IsSystemValue":true}|
* |123 |
* +-------------------------------------------------------+
*
* root
* |-- Value: string (nullable = true)
*/
转换字符串json
Transform string json
df.withColumn("File", substring_index(input_file_name(),"/", -1))
.withColumn("ValueType", get_json_object(col("Value"), "$.ValueType"))
.withColumn("IsSystemValue", get_json_object(col("Value"), "$.IsSystemValue"))
.withColumn("Value", coalesce(get_json_object(col("Value"), "$.Value"), col("Value")))
.show(false)
/**
* +-----+----------+---------+-------------+
* |Value|File |ValueType|IsSystemValue|
* +-----+----------+---------+-------------+
* |On |File2.json|State |true |
* |123 |File1.json|null |null |
* +-----+----------+---------+-------------+
*/
这篇关于Spark Read Json:如何读取在整数和结构之间交替的字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文