重命名 JSON 中的无效键 [英] rename invalid keys from JSON

查看:26
本文介绍了重命名 JSON 中的无效键的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 NIFI 中有以下流程,JSON 中有 (1000+) 个对象.

invokeHTTP->SPLIT JSON->putMongo

流程工作正常,直到我在 json 中收到一些带有."的键.在名字里.例如spark.databricks.acl.dfAclsEnabled".

我目前的解决方案不是最优的,我记下了坏键,并使用多个替换文本处理器来替换."和 "_".我没有使用正则表达式,我使用的是字符串文字查找/替换.因此,每次我在 putMongo 处理器中出现故障时,我都会插入新的 replaceText 处理器.

这是不可维护的.我想知道我是否可以为此使用 JOLT?关于输入 JSON 的一些信息.

1) 没有固定的结构,唯一确定的就是.一切都将在事件数组中.但事件对象本身是自由形式的.

2) 最大列表大小 = 1000.

3) 3rd party JSON,所以我不能要求改变格式.

另外,带."的键可以出现在任何地方.所以我正在寻找可以在所有级别进行清理然后重命名的 JOLT 规范.

<代码>{事件":[{"cluster_id": "0717-035521-puny598",时间戳":1531896847915,类型":编辑",细节": {previous_attributes":{"cluster_name": "Kylo","spark_version": "4.1.x-scala2.11",spark_conf":{"spark.databricks.acl.dfAclsEnabled": "true",spark.databricks.repl.allowedLanguages":python,sql"},"node_type_id": "Standard_DS3_v2","driver_node_type_id": "Standard_DS3_v2",自动终止_分钟":10,enable_elastic_disk":真,"cluster_source": "用户界面"},属性": {"cluster_name": "Kylo","spark_version": "4.1.x-scala2.11","node_type_id": "Standard_DS3_v2","driver_node_type_id": "Standard_DS3_v2",自动终止_分钟":10,enable_elastic_disk":真,"cluster_source": "用户界面"},previous_cluster_size":{自动缩放":{"min_workers": 1,最大工人":8}},簇的大小": {自动缩放":{"min_workers": 1,最大工人":8}},用户":"}},{"cluster_id": "0717-035521-puny598",时间戳":1535540053785,"type": "正在终止",细节": {原因": {代码":不活动",参数": {inactivity_duration_min":15"}}}},{"cluster_id": "0717-035521-puny598",时间戳":1535537117300,"type": "EXPANDED_DISK",细节": {previous_disk_size":29454626816,磁盘大小":136828809216,自由空间":17151311872,"instance_id": "6cea5c332af94d7f85aff23e5d8cea37"}}]}

解决方案

我创建了

您还可以使用带有 Groovy 的 ExecuteScript 处理器来摄取 JSON,快速过滤所有包含 的 JSON 键.,执行 collect操作来执行替换,如果您希望单个处理器在一次传递中执行此操作,则在 JSON 数据中重新插入密钥.

I have following flow in NIFI , JSON has (1000+) objects in it.

invokeHTTP->SPLIT JSON->putMongo

Flow works fine, till I receive some keys in json with "." in the name. e.g. "spark.databricks.acl.dfAclsEnabled".

my current solution is not optimal, I have jotted down bad keys, and using multiple replace text processor to replace "." with "_". I am not using REGEX, I am using string literal find/replace. So each time I am getting failure in putMongo processor, I am inserting new replaceText processor.

This is not maintainable. I am wondering if I can use JOLT for this? couple of info regarding input JSON.

1) no set structure, only thing that is confirmed is. everything will be in events array. But event object itself is free form.

2) maximum list size = 1000.

3) 3rd party JSON, so I cant ask for change in format.

Also, key with ".", can appear anywhere. So I am looking for JOLT spec that can cleanse at all level and then rename it.

{
  "events": [
    {
            "cluster_id": "0717-035521-puny598",
            "timestamp": 1531896847915,
            "type": "EDITED",
            "details": {
                "previous_attributes": {
                    "cluster_name": "Kylo",
                    "spark_version": "4.1.x-scala2.11",
                    "spark_conf": {
                        "spark.databricks.acl.dfAclsEnabled": "true",
                        "spark.databricks.repl.allowedLanguages": "python,sql"
                    },
                    "node_type_id": "Standard_DS3_v2",
                    "driver_node_type_id": "Standard_DS3_v2",
                    "autotermination_minutes": 10,
                    "enable_elastic_disk": true,
                    "cluster_source": "UI"
                },
                "attributes": {
                    "cluster_name": "Kylo",
                    "spark_version": "4.1.x-scala2.11",
                    "node_type_id": "Standard_DS3_v2",
                    "driver_node_type_id": "Standard_DS3_v2",
                    "autotermination_minutes": 10,
                    "enable_elastic_disk": true,
                    "cluster_source": "UI"
                },
                "previous_cluster_size": {
                    "autoscale": {
                        "min_workers": 1,
                        "max_workers": 8
                    }
                },
                "cluster_size": {
                    "autoscale": {
                        "min_workers": 1,
                        "max_workers": 8
                    }
                },
                "user": ""
            }
        },
    {
      "cluster_id": "0717-035521-puny598",
      "timestamp": 1535540053785,
      "type": "TERMINATING",
      "details": {
        "reason": {
          "code": "INACTIVITY",
          "parameters": {
            "inactivity_duration_min": "15"
          }
        }
      }
    },
    {
      "cluster_id": "0717-035521-puny598",
      "timestamp": 1535537117300,
      "type": "EXPANDED_DISK",
      "details": {
        "previous_disk_size": 29454626816,
        "disk_size": 136828809216,
        "free_space": 17151311872,
        "instance_id": "6cea5c332af94d7f85aff23e5d8cea37"
      }
    }
  ]
}

解决方案

I created a template using ReplaceText and RouteOnContent to perform this task. The loop is required because the regex only replaces the first . in the JSON key on each pass. You might be able to refine this to perform all substitutions in a single pass, but after fuzzing the regex with the look-ahead and look-behind groups for a few minutes, re-routing was faster. I verified this works with the JSON you provided, and also JSON with the keys and values on different lines (: on either):

...
"spark_conf": {
                        "spark.databricks.acl.dfAclsEnabled":
 "true",
                        "spark.databricks.repl.allowedLanguages"
: "python,sql"
                    },
...

You could also use an ExecuteScript processor with Groovy to ingest the JSON, quickly filter all JSON keys that contain ., perform a collect operation to do the replacement, and re-insert the keys in the JSON data if you want a single processor to do this in a single pass.

这篇关于重命名 JSON 中的无效键的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆