使用Spark DataFrames查询JSON数据列，但不确定其架构 [英] Query JSON data column using Spark DataFrames, But Not sure about its Schema

查看：154 发布时间：2019/11/26 23:15:39 json postgresql apache-spark apache-spark-sql

本文介绍了使用Spark DataFrames查询JSON数据列，但不确定其架构的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有这个特定的JSON文件，它是PostgreSQL生成的日志.我将此JSON文件从多行格式转换为单行格式.在我解析的数据框的一个字段中，有一个字符串列.此字符串列本身就是JSON格式，例如以下示例:

I have this particular JSON file which is a log generated by PostgreSQL. I turned this JSON file from multi-line format to single line format. In one of the fields in the Dataframe that I parse there is a String Column. This String Column is by itself in JSON format with this example:

"query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "671.64"
    },
    "ordering_operation": {
      "using_filesort": true,
      "cost_info": {
        "sort_cost": "100.00"
      },
      "table": {
        "table_name": "test1",
        "access_type": "range",
        "possible_keys": [
          "PRIMARY"
        ],
        "key": "PRIMARY",
        "used_key_parts": [
          "id"
        ],
        "key_length": "4",
        "rows_examined_per_scan": 100,
        "rows_produced_per_join": 100,
        "filtered": "100.00",
        "cost_info": {
          "read_cost": "21.64",
          "eval_cost": "20.00",
          "prefix_cost": "41.64",
          "data_read_per_join": "18K"
        },
        "used_columns": [
          "id",
          "c"
        ],
        "attached_condition": "(`proxydemo`.`test1`.`id` between 501747 and <cache>((504767 + 99)))"
      }
    }
  }
}

我知道在Spark 2.0+中我可以使用

I know that in Spark 2.0+ I can use

from_json(e:列，架构:StructType):列

from_json(e: Column, schema: StructType): Column

SparkSQL函数中的

函数.但是我不确定该字符串的模式应该是什么.我已经完成了许多模式和StructType定义，但是这是一个层次结构！而且我不明白该架构应如何定义！ `

function from SparkSQL functions. But I am not sure what should be the schema for this String. I have done many schema and StructType definitions but this one is kinda hierarchical! and I do not understand how this schema should be defined! `

推荐答案

我发现了嵌套模式的工作原理.

I found out how nested schemas work.

在此特定示例中，模式如下:

In this particular example schemas go as this:

对于对象的根:

  val query_block_schema = (new StructType)
      .add("select_id", LongType)
      .add("cost_info", StringType)
      .add("ordering_operation", StringType)

对于第二层:

  val query_plan_schema = (new StructType)
    .add("query_block", StringType)

以此类推...

因此，我认为此问题已解决.稍后，我将所有这些合并在一起，以防它们不为null并基本上平整整个嵌套对象.

So I consider this problem as solved. Later on, I merge all these together in case they are not null and basically flat the whole nested object.

这篇关于使用Spark DataFrames查询JSON数据列，但不确定其架构的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用Spark DataFrames查询JSON数据列，但不确定其架构 [英] Query JSON data column using Spark DataFrames, But Not sure about its Schema

问题描述

推荐答案

相关文章

JavaScript最新文章

热门教程

热门工具

登录关闭

使用Spark DataFrames查询JSON数据列，但不确定其架构 [英] Query JSON data column using Spark DataFrames, But Not sure about its Schema

问题描述

推荐答案

相关文章

JavaScript最新文章

热门教程

热门工具

登录 关闭

登录关闭