如何将从json字符串字段中提取的数组转换为bigquery重复字段? [英] How to convert an array extracted from a json string field to a bigquery Repeated field?

查看:16
本文介绍了如何将从json字符串字段中提取的数组转换为bigquery重复字段?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们已经在 Bigquery 表的字符串字段中加载了 json blob.我需要在表上创建一个视图(使用标准 sql),该视图会将数组字段提取为RECORD"类型的 bigquery 数组/重复字段(它本身包含一个重复字段).

这是一个示例记录(json_blob):

{"order_id":"123456","customer_id":"2abcd", "items":[{"line":"1","re​​f_ids":["66b56e60","9e7ca2b7"],"sku":"1111","amount":40 },{"line":"2","re​​f_ids":["7777h0","8888j0"],"sku":"2222","amount":10 }]}

我希望最终得到一个具有以下布局的视图:

<预><代码>[{"name": "order_id","type": "STRING",模式":NULLABLE"},{模式":NULLABLE","name": "customer_id",类型":字符串"},{模式":重复","name": "物品","类型": "记录",领域":[{模式":NULLABLE","name": "线",类型":字符串"},{模式":重复","name": "ref_ids",类型":字符串"},{模式":NULLABLE","name": "sku",类型":字符串"},{模式":NULLABLE","name": "金额",类型":整数"}]}]

Json_extract(json_blob, '$.items') 提取项目部分,但如何将其转换为RECORD"类型的 bigquery 数组,然后可以像普通的 bigquery 数组/重复的 STRUCT 一样进行处理?

感谢任何帮助.

解决方案

在撰写本文时,无法在 BigQuery 中使用 SQL 函数执行此操作,除非您可以对 JSON 中的值数量施加硬限制大批;请参阅相关问题跟踪项.您的选择是:

  • 以不同方式处理数据(例如使用 Cloud Dataflow 或其他工具),以便您可以将其从以换行符分隔的 JSON 加载到 BigQuery 中.
  • 使用 JavaScript UDF,它接受输入的 JSON 并返回所需的类型;这相当简单,但通常会使用更多 CPU(因此可能需要更高的计费等级).
  • 使用 SQL 函数时,要理解如果元素过多,解决方案就会失效.

这是使用 JavaScript UDF 的方法:

#standardSQLCREATE TEMP FUNCTION JsonToItems(输入字符串)RETURNS STRUCT, sku STRING, amount INT64>>>语言 js 为 """返回 JSON.parse(输入);""";WITH 输入 AS (SELECT '{"order_id":"123456","customer_id":"2abcd", "items":[{"line":"1","re​​f_ids":["66b56e60","9e7ca2b7"],"sku":"1111","amount":40 },{"line":"2","re​​f_ids":["7777h0","8888j0"],"sku":"2222","amount":10 }]}'作为json)选择JsonToItems(json).*从输入;

如果您确实想在没有 JavaScript 的情况下尝试基于 SQL 的方法,那么在解决上述功能请求之前,这里有一些技巧,其中数组元素的数量不得超过 10:

#standardSQL创建临时函数 JsonExtractRefIds(json STRING) AS ((选择 ARRAY_AGG(v 忽略空值)从 UNNEST([JSON_EXTRACT_SCALAR(json, '$.ref_ids[0]'),JSON_EXTRACT_SCALAR(json, '$.ref_ids[1]'),JSON_EXTRACT_SCALAR(json, '$.ref_ids[2]'),JSON_EXTRACT_SCALAR(json, '$.ref_ids[3]'),JSON_EXTRACT_SCALAR(json, '$.ref_ids[4]'),JSON_EXTRACT_SCALAR(json, '$.ref_ids[5]'),JSON_EXTRACT_SCALAR(json, '$.ref_ids[6]'),JSON_EXTRACT_SCALAR(json, '$.ref_ids[7]'),JSON_EXTRACT_SCALAR(json, '$.ref_ids[8]'),JSON_EXTRACT_SCALAR(json, '$.ref_ids[9]')]) AS v));创建临时函数 JsonToItem(json 字符串)RETURNS STRUCT, sku STRING, amount INT64>作为 (如果(json 是空的,空的,结构(JSON_EXTRACT_SCALAR(json, '$.line'),JsonExtractRefIds(json),JSON_EXTRACT_SCALAR(json, '$.sku'),CAST(JSON_EXTRACT_SCALAR(json, '$.amount') AS INT64))));创建临时函数 JsonToItems(json STRING) AS ((选择为结构CAST(JSON_EXTRACT_SCALAR(json, '$.order_id') AS INT64) AS order_id,JSON_EXTRACT_SCALAR(json, '$.customer_id') AS customer_id,(选择 ARRAY_AGG(v 忽略空值)从 UNNEST([JsonToItem(JSON_EXTRACT(json, '$.items[0]')),JsonToItem(JSON_EXTRACT(json, '$.items[1]')),JsonToItem(JSON_EXTRACT(json, '$.items[2]')),JsonToItem(JSON_EXTRACT(json, '$.items[3]')),JsonToItem(JSON_EXTRACT(json, '$.items[4]')),JsonToItem(JSON_EXTRACT(json, '$.items[5]')),JsonToItem(JSON_EXTRACT(json, '$.items[6]')),JsonToItem(JSON_EXTRACT(json, '$.items[7]')),JsonToItem(JSON_EXTRACT(json, '$.items[8]')),JsonToItem(JSON_EXTRACT(json, '$.items[9]'))]) AS v) AS 项));WITH 输入 AS (SELECT '{"order_id":"123456","customer_id":"2abcd", "items":[{"line":"1","re​​f_ids":["66b56e60","9e7ca2b7"],"sku":"1111","amount":40 },{"line":"2","re​​f_ids":["7777h0","8888j0"],"sku":"2222","amount":10 }]}'作为json)选择JsonToItems(json).*从输入;

We have loaded json blobs in a String field in a Bigquery table. I need to create a view (using standard sql)over the table that would extract the array field as a bigquery array/repeated field of "RECORD" type (which itself includes a repeated field).

Here is a sample record (json_blob):

{"order_id":"123456","customer_id":"2abcd", "items":[{"line":"1","ref_ids":["66b56e60","9e7ca2b7"],"sku":"1111","amount":40 },{"line":"2","ref_ids":["7777h0","8888j0"],"sku":"2222","amount":10 }]}

I am hoping to end up with a view that has the following layout:

[
{
    "name": "order_id",
    "type": "STRING",
    "mode": "NULLABLE"
},
{
    "mode": "NULLABLE",
    "name": "customer_id",
    "type": "STRING"
},
{
    "mode": "REPEATED",
    "name": "items",
    "type": "RECORD",
    "fields": [
        {
            "mode": "NULLABLE",
            "name": "line",
            "type": "STRING"
        },
        {
            "mode": "REPEATED",
            "name": "ref_ids",
            "type": "STRING"
        },
        {
            "mode": "NULLABLE",
            "name": "sku",
            "type": "STRING"
        },
        {
            "mode": "NULLABLE",
            "name": "amount",
            "type": "INTEGER"
        }
    ]
}
]

Json_extract(json_blob, '$.items') extracts the items parts, but how do I convert that to a bigquery array of type "RECORD" which then can be processed like normal bigquery array/repeated of STRUCT?

Appreciate any help.

解决方案

There is no way to do this using SQL functions in BigQuery at the time of this writing unless you can impose a hard limit on the number of values in the JSON array; see the relevant issue tracker item. Your options are:

  • Process the data differently (e.g. using Cloud Dataflow or another tool) so that you can load it from newline-delimited JSON into BigQuery.
  • Use a JavaScript UDF that takes the input JSON and returns the desired type; this is fairly straightforward but generally uses more CPU (and hence may require a higher billing tier).
  • Use SQL functions with the understanding that the solution breaks down if there are too many elements.

Here is the approach using a JavaScript UDF:

#standardSQL
CREATE TEMP FUNCTION JsonToItems(input STRING)
RETURNS STRUCT<order_id INT64, customer_id STRING, items ARRAY<STRUCT<line STRING, ref_ids ARRAY<STRING>, sku STRING, amount INT64>>>
LANGUAGE js AS """
return JSON.parse(input);
""";

WITH Input AS (
  SELECT '{"order_id":"123456","customer_id":"2abcd", "items":[{"line":"1","ref_ids":["66b56e60","9e7ca2b7"],"sku":"1111","amount":40 },{"line":"2","ref_ids":["7777h0","8888j0"],"sku":"2222","amount":10 }]}' AS json
)
SELECT
  JsonToItems(json).*
FROM Input;

If you do want to try the SQL-based approach without JavaScript, here's somewhat of a hack until the feature request above is resolved, where the number of array elements must be no more than 10:

#standardSQL
CREATE TEMP FUNCTION JsonExtractRefIds(json STRING) AS (
  (SELECT ARRAY_AGG(v IGNORE NULLS)
   FROM UNNEST([
     JSON_EXTRACT_SCALAR(json, '$.ref_ids[0]'),
     JSON_EXTRACT_SCALAR(json, '$.ref_ids[1]'),
     JSON_EXTRACT_SCALAR(json, '$.ref_ids[2]'),
     JSON_EXTRACT_SCALAR(json, '$.ref_ids[3]'),
     JSON_EXTRACT_SCALAR(json, '$.ref_ids[4]'),
     JSON_EXTRACT_SCALAR(json, '$.ref_ids[5]'),
     JSON_EXTRACT_SCALAR(json, '$.ref_ids[6]'),
     JSON_EXTRACT_SCALAR(json, '$.ref_ids[7]'),
     JSON_EXTRACT_SCALAR(json, '$.ref_ids[8]'),
     JSON_EXTRACT_SCALAR(json, '$.ref_ids[9]')]) AS v)
);

CREATE TEMP FUNCTION JsonToItem(json STRING)
RETURNS STRUCT<line STRING, ref_ids ARRAY<STRING>, sku STRING, amount INT64>
AS (
  IF(json IS NULL, NULL,
    STRUCT(
      JSON_EXTRACT_SCALAR(json, '$.line'),
      JsonExtractRefIds(json),
      JSON_EXTRACT_SCALAR(json, '$.sku'),
      CAST(JSON_EXTRACT_SCALAR(json, '$.amount') AS INT64)
    )
  )
);

CREATE TEMP FUNCTION JsonToItems(json STRING) AS (
  (SELECT AS STRUCT
    CAST(JSON_EXTRACT_SCALAR(json, '$.order_id') AS INT64) AS order_id,
    JSON_EXTRACT_SCALAR(json, '$.customer_id') AS customer_id,
    (SELECT ARRAY_AGG(v IGNORE NULLS)
     FROM UNNEST([
       JsonToItem(JSON_EXTRACT(json, '$.items[0]')),
       JsonToItem(JSON_EXTRACT(json, '$.items[1]')),
       JsonToItem(JSON_EXTRACT(json, '$.items[2]')),
       JsonToItem(JSON_EXTRACT(json, '$.items[3]')),
       JsonToItem(JSON_EXTRACT(json, '$.items[4]')),
       JsonToItem(JSON_EXTRACT(json, '$.items[5]')),
       JsonToItem(JSON_EXTRACT(json, '$.items[6]')),
       JsonToItem(JSON_EXTRACT(json, '$.items[7]')),
       JsonToItem(JSON_EXTRACT(json, '$.items[8]')),
       JsonToItem(JSON_EXTRACT(json, '$.items[9]'))]) AS v) AS items
  )
);

WITH Input AS (
  SELECT '{"order_id":"123456","customer_id":"2abcd", "items":[{"line":"1","ref_ids":["66b56e60","9e7ca2b7"],"sku":"1111","amount":40 },{"line":"2","ref_ids":["7777h0","8888j0"],"sku":"2222","amount":10 }]}' AS json
)
SELECT
  JsonToItems(json).*
FROM Input;

这篇关于如何将从json字符串字段中提取的数组转换为bigquery重复字段?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆