使用Spark DataFrames查询JSON数据列,但不确定其架构 [英] Query JSON data column using Spark DataFrames, But Not sure about its Schema

查看:154
本文介绍了使用Spark DataFrames查询JSON数据列,但不确定其架构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个特定的JSON文件,它是PostgreSQL生成的日志.我将此JSON文件从多行格式转换为单行格式.在我解析的数据框的一个字段中,有一个字符串列.此字符串列本身就是JSON格式,例如以下示例:

I have this particular JSON file which is a log generated by PostgreSQL. I turned this JSON file from multi-line format to single line format. In one of the fields in the Dataframe that I parse there is a String Column. This String Column is by itself in JSON format with this example:

"query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "671.64"
    },
    "ordering_operation": {
      "using_filesort": true,
      "cost_info": {
        "sort_cost": "100.00"
      },
      "table": {
        "table_name": "test1",
        "access_type": "range",
        "possible_keys": [
          "PRIMARY"
        ],
        "key": "PRIMARY",
        "used_key_parts": [
          "id"
        ],
        "key_length": "4",
        "rows_examined_per_scan": 100,
        "rows_produced_per_join": 100,
        "filtered": "100.00",
        "cost_info": {
          "read_cost": "21.64",
          "eval_cost": "20.00",
          "prefix_cost": "41.64",
          "data_read_per_join": "18K"
        },
        "used_columns": [
          "id",
          "c"
        ],
        "attached_condition": "(`proxydemo`.`test1`.`id` between 501747 and <cache>((504767 + 99)))"
      }
    }
  }
}

我知道在Spark 2.0+中我可以使用

I know that in Spark 2.0+ I can use

from_json(e:列,架构:StructType):列

from_json(e: Column, schema: StructType): Column

SparkSQL函数中的

函数.但是我不确定该字符串的模式应该是什么.我已经完成了许多模式和StructType定义,但是这是一个层次结构!而且我不明白该架构应如何定义! `

function from SparkSQL functions. But I am not sure what should be the schema for this String. I have done many schema and StructType definitions but this one is kinda hierarchical! and I do not understand how this schema should be defined! `

推荐答案

我发现了嵌套模式的工作原理.

I found out how nested schemas work.

在此特定示例中,模式如下:

In this particular example schemas go as this:

对于对象的根:

  val query_block_schema = (new StructType)
      .add("select_id", LongType)
      .add("cost_info", StringType)
      .add("ordering_operation", StringType)

对于第二层:

  val query_plan_schema = (new StructType)
    .add("query_block", StringType)

以此类推...

因此,我认为此问题已解决.稍后,我将所有这些合并在一起,以防它们不为null并基本上平整整个嵌套对象.

So I consider this problem as solved. Later on, I merge all these together in case they are not null and basically flat the whole nested object.

这篇关于使用Spark DataFrames查询JSON数据列,但不确定其架构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆