将任何JSON多次嵌套结构转换为KEY和VALUE字段 [英] Convert any JSON, multiple-times nested structure into the KEY and VALUE fields

查看:83
本文介绍了将任何JSON多次嵌套结构转换为KEY和VALUE字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我被要求在Azure中建立ETL管道.该管道应该

I was requested to build an ETL pipeline in Azure. This pipeline should

  1. 读取供应商提交给ADLS的ORC文件
  2. 解析存在于JSON结构中的ORC结构中存在的PARAMS字段,并将其作为两个新字段(KEY,VALUE)添加到输出中
  3. 将输出写入Azure SQL数据库

问题是,不同类型的记录使用不同类型的JSON结构.我不想为每个JSON结构类编写一个自定义表达式(可能有数百个).相反,我正在寻找一种通用机制,该机制将能够将它们解析为输入JSON结构的类型.

The problem is, that there are different types of JSONs structures used by the different types of records. I do not want to write a custom expression per each of the class of JSON struct (there would be like hundreds of them). Rather, I'm looking for a generic mechanism, that will be able to parse them apart of the type of the input JSON structure.

此刻,为了满足此要求,我使用了ADF内置的ORC连接器.当前设计中的过程:

At the moment, to fulfill this requirement, I was using the ADF built-in connector for ORC. The process in its current design:

  1. 使用复制活动读取ORC并将数据移动到Azure SQL数据库
  2. 使用以下TSQL语句作为在1.之后执行的存储过程的一部分,以解析PARAMS字段内容

  1. Use a copy activity that reads ORC and moves data to Azure SQL database
  2. Use the following TSQL statement as part of stored procedure executed after the 1. to parse the PARAMS field content

SELECT uuid, 
       AttrName = a1.[key] + 
                    COALESCE('.' + a2.[key], '') + 
                    COALESCE('.' + a3.[key], '') + 
                    COALESCE('.' + a4.[key], ''), 
       AttrValue = COALESCE(a4.value, a3.value, a2.value, a1.value)
FROM ORC.EventsSnapshot_RawData
     OUTER APPLY OPENJSON(params) a1
                                  OUTER APPLY
(
    SELECT [key], 
           value, 
           type
    FROM OPENJSON(a1.value)
    WHERE ISJSON(a1.value) = 1
) a2
  OUTER APPLY
(
    SELECT [key], 
           value, 
           type
    FROM OPENJSON(a2.value)
    WHERE ISJSON(a2.value) = 1
) a3
  OUTER APPLY
(
    SELECT [key], 
           value, 
           type
    FROM OPENJSON(a3.value)
    WHERE ISJSON(a3.value) = 1
) a4

所需的OUTER APPLY语句数是通过在PARAMS字段值中计算"["的出现次数来确定的,然后用于动态生成通过sp_executesql

The number of required OUTER APPLY statements is determined at the beginning by counting occurrences of "[" in the PARAMS field value and then used to dynamically generate the SQL executed via sp_executesql

不幸的是,这种方法在执行时间方面效率很低,因为大约需要11毫米的记录. 3.5小时完成

Unfortunately, this approach is quite inefficient in terms of execution time, as for 11 MM of records it takes c.a. 3.5 hours to finish

有人建议我使用Data Bricks.好的,所以我:

Someone suggested me to use Data Bricks. Ok, so I:

  1. 使用以下python代码创建笔记本,以从ADLS读取ORC并将其具体化为Data Bricks表

  1. created the notebook with the following python code to read ORC from ADLS and materialize it to Data Bricks table

    orcfile = "/mnt/adls/.../Input/*.orc"
    eventDf = spark.read.orc(orcfile)
    #spark.sql("drop table if exists  ORC.Events_RawData")
    eventDf.write.mode("overwrite").saveAsTable("ORC.Events_Raw")

  1. 现在,我正在尝试查找一个代码,该代码将给出从TSQL OPENJSONs获得的结果.我从使用递归来解析PARAMS属性的Python代码开始,但是,在执行速度方面,它甚至比TSQL效率低.

您能否建议我实现目标的正确方法,即以通用方式将PARAMS属性转换为KEY,VALUE属性?

Can you please suggest me the correct way of achieving the goal, i.e. converting the PARAMS attribute to KEY, VALUE attributes in a generic way?

请在下面找到一个需要标准化为预期结构的示例JSON结构

Please find below a sample JSON structures that needs to be standarized into the expected structure

样品1

    {
    "correlationId": "c3xOeEEQQCCA9sEx7-u6FA",
    "eventCreateTime": "2020-05-12T15:38:23.717Z",
    "time": 1589297903717,
    "owner": {
        "ownergeography": {
            "city": "abc",
            "country": "abc"
        },
        "ownername": {
            "firstname": "abc",
            "lastname": "def"
        },
        "clientApiKey": "xxxxx",
        "businessProfileApiKey": null,
        "userId": null
    },
    "campaignType": "Mobile push"
}

样本2

{
    "correlationIds": [
        {
            "campaignId": "iXyS4z811Rax",
            "correlationId": "b316233807ac68675f37787f5dd83871"
        }
    ],
    "variantId": 1278915,
    "utmCampaign": "",
    "ua.os.major": "8"
    }

示例3

{
    "correlationId": "ls7XmuuiThWzktUeewqgWg",
    "eventCreateTime": "2020-05-12T12:40:20.786Z",
    "time": 1589287220786,
    "modifiedBy": {
        "clientId": null,
        "clientApiKey": "xxx",
        "businessProfileApiKey": null,
        "userId": null
    },
    "campaignType": "Mobile push"
}

样本预期输出 (火花数据帧)

Sample expected output (Spark dataFrame)

推荐答案

好吧,这是您一劳永逸的方法:-)

Well, this is your get all and everything approach :-)

首先,我们创建一个声明的表变量,并用您的样本填充它以模拟您的问题(请尝试在下次提供此变量).

First we create a declared table variable and fill it with your samples to simuate your issue (please try to provide this yourself the next time).

DECLARE @table TABLE(ID INT IDENTITY, AnyJSON NVARCHAR(MAX));
INSERT INTO @table VALUES
(N' {
    "correlationId": "c3xOeEEQQCCA9sEx7-u6FA",
    "eventCreateTime": "2020-05-12T15:38:23.717Z",
    "time": 1589297903717,
    "owner": {
        "ownergeography": {
            "city": "abc",
            "country": "abc"
        },
        "ownername": {
            "firstname": "abc",
            "lastname": "def"
        },
        "clientApiKey": "xxxxx",
        "businessProfileApiKey": null,
        "userId": null
    },
    "campaignType": "Mobile push"
}')
,(N'{
    "correlationIds": [
        {
            "campaignId": "iXyS4z811Rax",
            "correlationId": "b316233807ac68675f37787f5dd83871"
        }
    ],
    "variantId": 1278915,
    "utmCampaign": "",
    "ua.os.major": "8"
    }')
,(N'{
    "correlationId": "ls7XmuuiThWzktUeewqgWg",
    "eventCreateTime": "2020-05-12T12:40:20.786Z",
    "time": 1589287220786,
    "modifiedBy": {
        "clientId": null,
        "clientApiKey": "xxx",
        "businessProfileApiKey": null,
        "userId": null
    },
    "campaignType": "Mobile push"
}');

-查询

WITH recCTE AS
(
    SELECT ID
          ,CAST(1 AS BIGINT) AS ObjectIndex
          ,CAST(N'000' COLLATE DATABASE_DEFAULT AS NVARCHAR(MAX)) SortString
          ,1 AS NestLevel
          ,CAST(CONCAT(N'Root-',ID,'.') COLLATE DATABASE_DEFAULT AS NVARCHAR(MAX)) AS JsonPath
          ,CAST(N'$' COLLATE DATABASE_DEFAULT AS NVARCHAR(MAX)) AS JsonKey
          ,CAST(AnyJSON COLLATE DATABASE_DEFAULT AS NVARCHAR(MAX)) AS JsonValue 
          ,CAST(CASE WHEN ISJSON(AnyJSON)=1 THEN AnyJSON COLLATE DATABASE_DEFAULT ELSE NULL END AS NVARCHAR(MAX)) AS NestedJSON 
    FROM @table t

    UNION ALL

    SELECT r.ID
          ,ROW_NUMBER() OVER(ORDER BY (SELECT NULL))
          ,CAST(CONCAT(r.SortString,STR(ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),3)) AS NVARCHAR(MAX))
          ,r.NestLevel+1
          ,CAST(CONCAT(r.JsonPath, A.[key] + N'.') COLLATE DATABASE_DEFAULT AS NVARCHAR(MAX))
          ,CAST(A.[key] COLLATE DATABASE_DEFAULT AS NVARCHAR(MAX))
          ,r.JsonValue  COLLATE DATABASE_DEFAULT
          ,CAST(A.[value] COLLATE DATABASE_DEFAULT AS NVARCHAR(MAX))
    FROM recCTE r
    CROSS APPLY OPENJSON(r.NestedJSON) A
    WHERE ISJSON(r.NestedJSON)=1
)
SELECT ID
      ,JsonPath
      ,JsonKey
      ,NestedJSON AS JsonValue
FROM recCTE 
WHERE ISJSON(NestedJSON)=0
ORDER BY recCTE.ID,SortString;

结果

+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.correlationId.                  | correlationId   | c3xOeEEQQCCA9sEx7-u6FA           |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.eventCreateTime.                | eventCreateTime | 2020-05-12T15:38:23.717Z         |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.time.                           | time            | 1589297903717                    |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.owner.ownergeography.city.      | city            | abc                              |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.owner.ownergeography.country.   | country         | abc                              |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.owner.ownername.firstname.      | firstname       | abc                              |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.owner.ownername.lastname.       | lastname        | def                              |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.owner.clientApiKey.             | clientApiKey    | xxxxx                            |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.campaignType.                   | campaignType    | Mobile push                      |
+---+----------------------------------------+-----------------+----------------------------------+
| 2 | Root-2.correlationIds.0.campaignId.    | campaignId      | iXyS4z811Rax                     |
+---+----------------------------------------+-----------------+----------------------------------+
| 2 | Root-2.correlationIds.0.correlationId. | correlationId   | b316233807ac68675f37787f5dd83871 |
+---+----------------------------------------+-----------------+----------------------------------+
| 2 | Root-2.variantId.                      | variantId       | 1278915                          |
+---+----------------------------------------+-----------------+----------------------------------+
| 2 | Root-2.utmCampaign.                    | utmCampaign     |                                  |
+---+----------------------------------------+-----------------+----------------------------------+
| 2 | Root-2.ua.os.major.                    | ua.os.major     | 8                                |
+---+----------------------------------------+-----------------+----------------------------------+
| 3 | Root-3.correlationId.                  | correlationId   | ls7XmuuiThWzktUeewqgWg           |
+---+----------------------------------------+-----------------+----------------------------------+
| 3 | Root-3.eventCreateTime.                | eventCreateTime | 2020-05-12T12:40:20.786Z         |
+---+----------------------------------------+-----------------+----------------------------------+
| 3 | Root-3.time.                           | time            | 1589287220786                    |
+---+----------------------------------------+-----------------+----------------------------------+
| 3 | Root-3.modifiedBy.clientApiKey.        | clientApiKey    | xxx                              |
+---+----------------------------------------+-----------------+----------------------------------+
| 3 | Root-3.campaignType.                   | campaignType    | Mobile push                      |
+---+----------------------------------------+-----------------+----------------------------------+

简而言之:

  • 我们使用递归CTE来解决这个问题.
  • 查询将测试任何片段(来自OPENJSON[value])是否为有效JSON.
  • 如果片段是有效的,它将走得越来越远.
  • 需要SortString列以获得最终的排序顺序.
  • we use a recursive CTE to walk this down.
  • The query will test any fragment ([value] coming from OPENJSON) for being valid JSON.
  • If the fragment is valid, this walks deeper and deeper.
  • The column SortString is needed to get a final sort order.

如果您有任何未解决的问题,请回来.

Come back, if you have any open questions.

这篇关于将任何JSON多次嵌套结构转换为KEY和VALUE字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆