如何将从json字符串字段提取的数组转换为bigquery重复字段? [英] How to convert an array extracted from a json string field to a bigquery Repeated field?
问题描述
我们已经在Bigquery表的String字段中加载了json blob.我需要在表上创建一个视图(使用标准sql),该视图会将数组字段提取为"RECORD"类型的bigquery数组/重复字段(其本身包括重复字段).
We have loaded json blobs in a String field in a Bigquery table. I need to create a view (using standard sql)over the table that would extract the array field as a bigquery array/repeated field of "RECORD" type (which itself includes a repeated field).
这是示例记录(json_blob):
Here is a sample record (json_blob):
{"order_id":"123456","customer_id":"2abcd", "items":[{"line":"1","ref_ids":["66b56e60","9e7ca2b7"],"sku":"1111","amount":40 },{"line":"2","ref_ids":["7777h0","8888j0"],"sku":"2222","amount":10 }]}
我希望最终得到一个具有以下布局的视图:
I am hoping to end up with a view that has the following layout:
[
{
"name": "order_id",
"type": "STRING",
"mode": "NULLABLE"
},
{
"mode": "NULLABLE",
"name": "customer_id",
"type": "STRING"
},
{
"mode": "REPEATED",
"name": "items",
"type": "RECORD",
"fields": [
{
"mode": "NULLABLE",
"name": "line",
"type": "STRING"
},
{
"mode": "REPEATED",
"name": "ref_ids",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "sku",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "amount",
"type": "INTEGER"
}
]
}
]
Json_extract(json_blob,'$ .items')提取项目部分,但是如何将其转换为类型为"RECORD"的bigquery数组,然后可以像正常bigquery数组/重复STRUCT一样对其进行处理?
Json_extract(json_blob, '$.items') extracts the items parts, but how do I convert that to a bigquery array of type "RECORD" which then can be processed like normal bigquery array/repeated of STRUCT?
感谢任何帮助.
推荐答案
在撰写本文时,除非在JSON中对值的数量施加硬限制,否则无法在BigQuery中使用SQL函数来完成此操作大批;请参阅相关的问题跟踪器项.您的选择是:
There is no way to do this using SQL functions in BigQuery at the time of this writing unless you can impose a hard limit on the number of values in the JSON array; see the relevant issue tracker item. Your options are:
- 以不同的方式处理数据(例如,使用Cloud Dataflow或其他工具),以便您可以将以换行符分隔的JSON将其加载到BigQuery中.
- 使用JavaScript UDF,该JavaScript UDF接受输入的JSON并返回所需的类型;这很简单,但是通常使用更多的CPU(因此可能需要更高的计费层).
- 使用SQL函数时应了解,如果元素太多,则解决方案就会失效.
以下是使用JavaScript UDF的方法:
Here is the approach using a JavaScript UDF:
#standardSQL
CREATE TEMP FUNCTION JsonToItems(input STRING)
RETURNS STRUCT<order_id INT64, customer_id STRING, items ARRAY<STRUCT<line STRING, ref_ids ARRAY<STRING>, sku STRING, amount INT64>>>
LANGUAGE js AS """
return JSON.parse(input);
""";
WITH Input AS (
SELECT '{"order_id":"123456","customer_id":"2abcd", "items":[{"line":"1","ref_ids":["66b56e60","9e7ca2b7"],"sku":"1111","amount":40 },{"line":"2","ref_ids":["7777h0","8888j0"],"sku":"2222","amount":10 }]}' AS json
)
SELECT
JsonToItems(json).*
FROM Input;
如果您确实想尝试不带JavaScript的基于SQL的方法,那么在上述功能请求得到解决之前,这里有些骇人听闻,其中数组元素的数量不得超过10:
If you do want to try the SQL-based approach without JavaScript, here's somewhat of a hack until the feature request above is resolved, where the number of array elements must be no more than 10:
#standardSQL
CREATE TEMP FUNCTION JsonExtractRefIds(json STRING) AS (
(SELECT ARRAY_AGG(v IGNORE NULLS)
FROM UNNEST([
JSON_EXTRACT_SCALAR(json, '$.ref_ids[0]'),
JSON_EXTRACT_SCALAR(json, '$.ref_ids[1]'),
JSON_EXTRACT_SCALAR(json, '$.ref_ids[2]'),
JSON_EXTRACT_SCALAR(json, '$.ref_ids[3]'),
JSON_EXTRACT_SCALAR(json, '$.ref_ids[4]'),
JSON_EXTRACT_SCALAR(json, '$.ref_ids[5]'),
JSON_EXTRACT_SCALAR(json, '$.ref_ids[6]'),
JSON_EXTRACT_SCALAR(json, '$.ref_ids[7]'),
JSON_EXTRACT_SCALAR(json, '$.ref_ids[8]'),
JSON_EXTRACT_SCALAR(json, '$.ref_ids[9]')]) AS v)
);
CREATE TEMP FUNCTION JsonToItem(json STRING)
RETURNS STRUCT<line STRING, ref_ids ARRAY<STRING>, sku STRING, amount INT64>
AS (
IF(json IS NULL, NULL,
STRUCT(
JSON_EXTRACT_SCALAR(json, '$.line'),
JsonExtractRefIds(json),
JSON_EXTRACT_SCALAR(json, '$.sku'),
CAST(JSON_EXTRACT_SCALAR(json, '$.amount') AS INT64)
)
)
);
CREATE TEMP FUNCTION JsonToItems(json STRING) AS (
(SELECT AS STRUCT
CAST(JSON_EXTRACT_SCALAR(json, '$.order_id') AS INT64) AS order_id,
JSON_EXTRACT_SCALAR(json, '$.customer_id') AS customer_id,
(SELECT ARRAY_AGG(v IGNORE NULLS)
FROM UNNEST([
JsonToItem(JSON_EXTRACT(json, '$.items[0]')),
JsonToItem(JSON_EXTRACT(json, '$.items[1]')),
JsonToItem(JSON_EXTRACT(json, '$.items[2]')),
JsonToItem(JSON_EXTRACT(json, '$.items[3]')),
JsonToItem(JSON_EXTRACT(json, '$.items[4]')),
JsonToItem(JSON_EXTRACT(json, '$.items[5]')),
JsonToItem(JSON_EXTRACT(json, '$.items[6]')),
JsonToItem(JSON_EXTRACT(json, '$.items[7]')),
JsonToItem(JSON_EXTRACT(json, '$.items[8]')),
JsonToItem(JSON_EXTRACT(json, '$.items[9]'))]) AS v) AS items
)
);
WITH Input AS (
SELECT '{"order_id":"123456","customer_id":"2abcd", "items":[{"line":"1","ref_ids":["66b56e60","9e7ca2b7"],"sku":"1111","amount":40 },{"line":"2","ref_ids":["7777h0","8888j0"],"sku":"2222","amount":10 }]}' AS json
)
SELECT
JsonToItems(json).*
FROM Input;
这篇关于如何将从json字符串字段提取的数组转换为bigquery重复字段?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!