使用Spark DataFrames查询JSON数据列,但不确定其架构 [英] Query JSON data column using Spark DataFrames, But Not sure about its Schema
问题描述
我有这个特定的JSON文件,它是PostgreSQL生成的日志.我将此JSON文件从多行格式转换为单行格式.在我解析的数据框的一个字段中,有一个字符串列.此字符串列本身就是JSON格式,例如以下示例:
I have this particular JSON file which is a log generated by PostgreSQL. I turned this JSON file from multi-line format to single line format. In one of the fields in the Dataframe that I parse there is a String Column. This String Column is by itself in JSON format with this example:
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "671.64"
},
"ordering_operation": {
"using_filesort": true,
"cost_info": {
"sort_cost": "100.00"
},
"table": {
"table_name": "test1",
"access_type": "range",
"possible_keys": [
"PRIMARY"
],
"key": "PRIMARY",
"used_key_parts": [
"id"
],
"key_length": "4",
"rows_examined_per_scan": 100,
"rows_produced_per_join": 100,
"filtered": "100.00",
"cost_info": {
"read_cost": "21.64",
"eval_cost": "20.00",
"prefix_cost": "41.64",
"data_read_per_join": "18K"
},
"used_columns": [
"id",
"c"
],
"attached_condition": "(`proxydemo`.`test1`.`id` between 501747 and <cache>((504767 + 99)))"
}
}
}
}
我知道在Spark 2.0+中我可以使用
I know that in Spark 2.0+ I can use
from_json(e:列,架构:StructType):列
from_json(e: Column, schema: StructType): Column
SparkSQL函数中的
函数.但是我不确定该字符串的模式应该是什么.我已经完成了许多模式和StructType定义,但是这是一个层次结构!而且我不明白该架构应如何定义! `
function from SparkSQL functions. But I am not sure what should be the schema for this String. I have done many schema and StructType definitions but this one is kinda hierarchical! and I do not understand how this schema should be defined! `
推荐答案
我发现了嵌套模式的工作原理.
I found out how nested schemas work.
在此特定示例中,模式如下:
In this particular example schemas go as this:
对于对象的根:
val query_block_schema = (new StructType)
.add("select_id", LongType)
.add("cost_info", StringType)
.add("ordering_operation", StringType)
对于第二层:
val query_plan_schema = (new StructType)
.add("query_block", StringType)
以此类推...
因此,我认为此问题已解决.稍后,我将所有这些合并在一起,以防它们不为null并基本上平整整个嵌套对象.
So I consider this problem as solved. Later on, I merge all these together in case they are not null and basically flat the whole nested object.
这篇关于使用Spark DataFrames查询JSON数据列,但不确定其架构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!