从bigquery中的json字符串中提取键和值,其中json文档中没有指定的键 [英] Extract keys and values from json string in bigquery where there is no specified key in the json document
问题描述
我在bigquery中有一个表,其中有一个对象,对于每个对象,我都有一些字符串化的json.在json中,示例行如下所示:
I have a table in bigquery where I have object and for each object I have some stringified json. In json, an example row look like below:
{
"ObjectID": "1984931229",
"indexed_abstract": "{\"IndexLength\":123,\"InvertedIndex\":{\"Twenty-seven\":[0],\"metastatic\":[1,45],\"breast\":[2],\"adenocarcinoma\":[3],\"patients,\":[4]}}"
}
在indexed_abstract
内部,我们有一个InvertedIndex
,其中包含一些关键字以及这些关键字在ObjectID
中出现了多少次.
where inside the indexed_abstract
we have an InvertedIndex
which contains some keywords and how many times these keywords appeared in the ObjectID
.
现在,我想通过使用bigquery解析json来访问字符串化的json,对于每个ObjectID
,我想创建一个嵌套字段,其中包含关键字,对应的数组和对应的数组的长度.
Now I want to access the stringified json by parsing the json using bigquery and for each ObjectID
I want to create a nested field where I have the keyword, the corresponding array and the length of the corresponding array.
例如,在这种情况下,输出将如下所示:
For example in this case the output would look like the following:
+------------+----------------+---------------+-------------------+
| ObjectID | keyword.key | keyword.count | keyword.positions |
+------------+----------------+---------------+-------------------+
| 1984931229 | Twenty-seven | 1 | [0] |
| | metastatic | 2 | [1,45] |
| | breast | 1 | [2] |
| | adenocarcinoma | 1 | [3] |
| | patients | 1 | [4] |
+------------+----------------+---------------+-------------------+
我知道我可以使用JSON_EXTRACT函数,但是我不确定在倒排索引中访问关键字和对应于它们的数组的键是什么.
I understand I could use JSON_EXTRACT function but I am not sure what would be my key inside the inverted index to access the keywords and the arrays corresponding to them.
推荐答案
以下是BigQuery标准SQL
Below is for BigQuery Standard SQL
#standardSQL
SELECT ObjectID,
ARRAY(
SELECT AS STRUCT
key,
ARRAY_LENGTH(SPLIT(value)) `count`,
value positions
FROM UNNEST(REGEXP_EXTRACT_ALL(JSON_EXTRACT(indexed_abstract, '$.InvertedIndex'), r'"[^"]+":\[[\d,]*?]')) pair,
UNNEST([STRUCT(REPLACE(SPLIT(pair, ':')[OFFSET(0)], '"', '') AS key, SPLIT(pair, ':')[OFFSET(1)] AS value)])
) keyword
FROM `project.dataset.table`
如果要应用于您的问题的样本数据-结果为
If to apply to sample data from your question - result is
Row ObjectID keyword.key keyword.count keyword.positions
1 1984931229 Twenty-seven 1 [0]
metastatic 2 [1,45]
breast 1 [2]
adenocarcinoma 1 [3]
patients 1 [4]
关于Op的评论的最新信息-我想知道我是否想将位置设置为数组(重复的字段),我该怎么做?
更改只需一行即可
SPLIT(REGEXP_REPLACE(value, r'\[|]', '')) positions
这篇关于从bigquery中的json字符串中提取键和值,其中json文档中没有指定的键的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!