如何取消嵌套在以字符串形式存储的bigquery字段中的多个数组? [英] How to unnest multiple arrays in bigquery field that is stored as a string?

查看:34
本文介绍了如何取消嵌套在以字符串形式存储的bigquery字段中的多个数组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要帮助来嵌套一个具有多个数组并且在嵌套对象中重复同一字段的字段(数量在嵌套对象的内部和外部重复).

I need help to unnest a field that has multiple arrays and same field repeated in nested objects (quantity is repeated outside and inside nested object).

数据集中有2个字段:order_id和cart,其中cart是具有多个列表的字典对象,其中包括列表"items"中的列表,但是cart的数据类型为字符串.我希望输出是每个产品和类别的单独行.

There are 2 fields in the dataset: order_id and cart, where cart is a dictionary object with multiple lists including lists within a list "items" in it, but the datatype of a cart is string. I would like the output to be individual row for each product and category.

使用部分工作的查询对数据进行采样.

Sample data with the partially working query.

#standardSQL
WITH t AS (
    SELECT "order1234" as order_id, '{ "_id" : "cart1234" , "taxRate" : 0.0 , "items" : [{ "quantity" : 1 , "product" : { "_id" : "prod1" , "categoryIds" : [ "cat1", "cat2", "cat3"] , "name" : "Product 1" , "pricing" : { "listPrice" : { "value" : 899} , "salePrice" : { "value" : 725}}, "imagedata" : { "imageLink" : { "_id" : "img1" , "createdOn" : { "$date" : "2019-01-19T19:55:19.782Z"} , "revision" : 1} , "title" : "" , "description" : "" , "altText" : ""} , "variants" : [ ] , "productVariants" : [ { "_id" : "var1" , "sku" : { "value" : "sku1" , "modifier" : 0} , "variants" : [ ] , "quantity" : 0 , "imageLinkIds" : [ ] , "skuImageLinkIds" : [ ] , "fulfillmentData" : { "sourceName" :  null  , "sourceId" :  null  , "sourceSku" :  null  , "sourceMethod" :  null  , "sourceRedirectUrl" :  null  , "sourceRedirectAppKey" :  null }}] , "Shipping" : true}} , { "quantity" : 2 , "product" : { "_id" : "prod2" , "categoryIds" : [ "cat2", "cat4"] , "name" : "Product 2" , "pricing" : { "listPrice" : { "value" : 199} , "salePrice" : { "value" : 150}}, "imagedata" : { "imageLink" : { "_id" : "img2" , "createdOn" : { "$date" : "2019-01-19T19:58:11.484Z"} ,  "revision" : 1} , "title" : "" , "description" : "" , "altText" : ""} , "variants" : [ ] , "productVariants" : [ { "_id" : "var2" , "sku" : { "value" : "sku2" , "modifier" : 0} , "variants" : [ ] , "quantity" : 0 , "imageLinkIds" : [ ] , "skuImageLinkIds" : [ ] , "fulfillmentData" : { "sourceName" :  null  , "sourceId" :  null  , "sourceSku" :  null  , "sourceMethod" :  null  , "sourceRedirectUrl" :  null  , "sourceRedirectAppKey" :  null }}] , "Shipping" : true}} , { "quantity" : 3 , "product" : { "_id" : "prod3" , "categoryIds" : [ "cat2","cat4","cat5"] , "name" : "Product 3" , "pricing" : { "listPrice" : { "value" : 499} , "salePrice" : { "value" : 325}}, "imagedata" : { "imageLink" : { "_id" : "img3" , "createdOn" : { "$date" : "2019-01-15T05:34:17.556Z"} , "revision" : 3} , "title" : "" , "description" : "" , "altText" : ""} , "variants" : [ ] , "productVariants" : [ { "_id" : "var3" , "sku" : { "value" : "sku3" , "modifier" : 0} , "variants" : [ ] , "quantity" : 0 , "imageLinkIds" : [ ] , "skuImageLinkIds" : [ ] , "fulfillmentData" : { "sourceName" :  null  , "sourceId" :  null  , "sourceSku" :  null  , "sourceMethod" :  null  , "sourceRedirectUrl" :  null  , "sourceRedirectAppKey" :  null }}], "Shipping" : true }}]}' as cart
)
select order_id, quantity, product, JSON_EXTRACT_SCALAR(product,'$._id') as product_id, REPLACE(category_id, '"', '') category_id, 
JSON_EXTRACT_SCALAR(product,'$.pricing.listPrice.value') as product_list_price,
JSON_EXTRACT_SCALAR(product,'$.pricing.salePrice.value') as product_sale_price
from t,
UNNEST(REGEXP_EXTRACT_ALL(cart, r'"categoryIds" : \[(.+?)]')) categoryIds WITH OFFSET pos1,
UNNEST(SPLIT(categoryIds)) category_id,
UNNEST(REGEXP_EXTRACT_ALL(cart, r'"product" : (.*?)\}')) product WITH OFFSET pos2,
UNNEST(REGEXP_EXTRACT_ALL(cart, r'"quantity" : (.+?)')) quantity WITH OFFSET pos3
where pos1= pos2 and pos1 = pos3

在上述查询中,数量字段不正确,并且product_list_price广告和product_sale_price未显示.请记住在嵌套元素中重复的数量.我认为我的正则表达式是错误的,因此我需要以某种方式在每个项目"中选择第一个数量",并且就价格而言,我的产品正则表达式没有给我完整的产品词典,这就是为什么它们返回为null的原因.知道产品密钥内部可能有多个{}的情况下,要获得产品密钥的完整价值的正确正则表达式是什么?

In the above query, quantity field is incorrect and product_list_price ad product_sale_price are not showing up. Keep in mind quantity in repeated in nested elements. I am thinking my regex is wrong and somehow I need to pick the first "quantity" within each "items", and for price my regex for product doesnt give me full product dictionary thats why they are returned as null. What is the correct Regex for getting the complete value for the product key knowing there could be several { } inside the product key?

预期结果

order_id  quantity  product_id  category_id  product_list_price   product_sale_price
order1234    1     prod1        cat1             899                 799
order1234    1     prod1        cat2             899                 799
order1234    1     prod1        cat3             899                 799
order1234    2     prod2        cat2             199                 150
order1234    2     prod2        cat4             199                 150
order1234    3     prod3        cat2             499                 399 
order1234    3     prod3        cat4             499                 399
order1234    3     prod3        cat5             499                 399

推荐答案

要知道产品密钥内可能有多个{},要获得产品密钥的完整值的正确正则表达式是什么?

What is the correct Regex for getting the complete value for the product key knowing there could be several { } inside the product key?

理想情况下,应该使用JSON_EXTRACT(而不是REGEXP_EXTRACT-这会使事情变得过于复杂).但是很不幸,BigQuery的JSON_EXTRACT有一些限制,不允许处理JSON数组

Ideally, there should be JSON_EXTRACT used (and not REGEXP_EXTRACT - which overcomplicates things). But unfortunately BigQuery's JSON_EXTRACT has some limitation that does not allow processing JSON arrays

要克服JsonPath的BigQuery限制",您可以使用 https://code下载. google.com/archive/p/jsonpath/downloads 并上传到Google Cloud Storage-gs://your_bucket/jsonpath-0.8.0.js

To overcome BigQuery "limitation" for JsonPath, you can use custom function as below example shows:
It uses jsonpath-0.8.0.js that can be downloaded from https://code.google.com/archive/p/jsonpath/downloads and uploaded to Google Cloud Storage - gs://your_bucket/jsonpath-0.8.0.js

#standardSQL
CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
        return jsonPath(JSON.parse(json), json_path);
"""
OPTIONS (
    library="gs://your_bucket/jsonpath-0.8.0.js"
);
SELECT order_id, quantity, product_id, category_id
FROM `project.dataset.table`,
UNNEST(CUSTOM_JSON_EXTRACT(cart, '$.items[*].quantity')) quantity WITH OFFSET pos1,
UNNEST(CUSTOM_JSON_EXTRACT(cart, '$.items[*].product._id')) product_id WITH OFFSET pos2,
UNNEST(CUSTOM_JSON_EXTRACT(cart, '$.items[*].product.categoryIds')) category_ids WITH OFFSET pos3,
UNNEST(SPLIT(category_ids)) category_id
WHERE pos1 = pos2 AND pos1 = pos3

您可以使用您提供的示例数据进行上述测试:

You can test, play with above using sample data you provided:

#standardSQL
CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
        return jsonPath(JSON.parse(json), json_path);
"""
OPTIONS (
    library="gs://your_bucket/jsonpath-0.8.0.js"
);
WITH t AS (
    SELECT "order1234" AS order_id, '''{ "_id" : "cart1234" , "taxRate" : 0.0 , "items" : [
      { "quantity" : 1 , "product" : { "_id" : "prod1" , "categoryIds" : [ "cat1", "cat2", "cat3"] , "name" : "Product 1" , "imagedata" : { "imageLink" : { "_id" : "img1" , "createdOn" : { "$date" : "2019-01-19T19:55:19.782Z"} , "revision" : 1} , "title" : "" , "description" : "" , "altText" : ""} , "variants" : [ ] , "productVariants" : [ { "_id" : "var1" , "sku" : { "value" : "sku1" , "modifier" : 0} , "variants" : [ ] , "quantity" : 0 , "imageLinkIds" : [ ] , "skuImageLinkIds" : [ ] , "fulfillmentData" : { "sourceName" :  null  , "sourceId" :  null  , "sourceSku" :  null  , "sourceMethod" :  null  , "sourceRedirectUrl" :  null  , "sourceRedirectAppKey" :  null }}] , "Shipping" : true}} , 
      { "quantity" : 2 , "product" : { "_id" : "prod2" , "categoryIds" : [ "cat2", "cat4"] , "name" : "Product 2" , "imagedata" : { "imageLink" : { "_id" : "img2" , "createdOn" : { "$date" : "2019-01-19T19:58:11.484Z"} ,  "revision" : 1} , "title" : "" , "description" : "" , "altText" : ""} , "variants" : [ ] , "productVariants" : [ { "_id" : "var2" , "sku" : { "value" : "sku2" , "modifier" : 0} , "variants" : [ ] , "quantity" : 0 , "imageLinkIds" : [ ] , "skuImageLinkIds" : [ ] , "fulfillmentData" : { "sourceName" :  null  , "sourceId" :  null  , "sourceSku" :  null  , "sourceMethod" :  null  , "sourceRedirectUrl" :  null  , "sourceRedirectAppKey" :  null }}] , "Shipping" : true}} , 
      { "quantity" : 3 , "product" : { "_id" : "prod3" , "categoryIds" : [ "cat2","cat4","cat5"] , "name" : "Product 3" , "imagedata" : { "imageLink" : { "_id" : "img3" , "createdOn" : { "$date" : "2019-01-15T05:34:17.556Z"} , "revision" : 3} , "title" : "" , "description" : "" , "altText" : ""} , "variants" : [ ] , "productVariants" : [ { "_id" : "var3" , "sku" : { "value" : "sku3" , "modifier" : 0} , "variants" : [ ] , "quantity" : 0 , "imageLinkIds" : [ ] , "skuImageLinkIds" : [ ] , "fulfillmentData" : { "sourceName" :  null  , "sourceId" :  null  , "sourceSku" :  null  , "sourceMethod" :  null  , "sourceRedirectUrl" :  null  , "sourceRedirectAppKey" :  null }}], "Shipping" : true }}
    ]}''' AS cart    
)
SELECT order_id, quantity, product_id, category_id
FROM t,
UNNEST(CUSTOM_JSON_EXTRACT(cart, '$.items[*].quantity')) quantity WITH OFFSET pos1,
UNNEST(CUSTOM_JSON_EXTRACT(cart, '$.items[*].product._id')) product_id WITH OFFSET pos2,
UNNEST(CUSTOM_JSON_EXTRACT(cart, '$.items[*].product.categoryIds')) category_ids WITH OFFSET pos3,
UNNEST(SPLIT(category_ids)) category_id
WHERE pos1 = pos2 AND pos1 = pos3

有结果

Row order_id    quantity    product_id  category_id  
1   order1234   1           prod1       cat1     
2   order1234   1           prod1       cat2     
3   order1234   1           prod1       cat3     
4   order1234   2           prod2       cat2     
5   order1234   2           prod2       cat4     
6   order1234   3           prod3       cat2     
7   order1234   3           prod3       cat4     
8   order1234   3           prod3       cat5     

注意:样本数据中不存在product_list_priceproduct_sale_price,因此不在上述结果中.但是现在查询非常干净和简单,因此希望您能够轻松添加这些

Note: product_list_price and product_sale_price are not present in your sample daat, thus it is not in the above result. But now query is extremely clean and simple, so hope you will be able to easily add those

这篇关于如何取消嵌套在以字符串形式存储的bigquery字段中的多个数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆