PySpark:从“字符串类型"列中读取嵌套的JSON并创建列 [英] PySpark: Read nested JSON from a String Type Column and create columns
问题描述
我在PySpark中有一个数据框,其中包含3列-json,日期和object_id:
I have a dataframe in PySpark with 3 columns - json, date and object_id:
-----------------------------------------------------------------------------------------
|json |date |object_id|
-----------------------------------------------------------------------------------------
|{'a':{'b':0,'c':{'50':0.005,'60':0,'100':0},'d':0.01,'e':0,'f':2}}|2020-08-01|xyz123 |
|{'a':{'m':0,'n':{'50':0.005,'60':0,'100':0},'d':0.01,'e':0,'f':2}}|2020-08-02|xyz123 |
|{'g':{'h':0,'j':{'50':0.005,'80':0,'100':0},'d':0.02}} |2020-08-03|xyz123 |
-----------------------------------------------------------------------------------------
现在,我有一个变量列表:[a.c.60,a.n.60,a.d,g.h].我只需要从上述数据框的json列中提取这些变量,然后将这些变量作为数据列中的列及其各自的值添加即可.
Now I have a list of variables: [a.c.60, a.n.60, a.d, g.h]. I need to extract only these variables from the json column of above mentioned dataframe and to add those variables as columns in the dataframe with their respective values.
最后,数据框应如下所示:
So in the end, the dataframe should look like:
-------------------------------------------------------------------------------------------------------
|json |date |object_id|a.c.60|a.n.60|a.d |g.h|
-------------------------------------------------------------------------------------------------------
|{'a':{'b':0,'c':{'50':0.005,'60':0,'100':0},'d':0.01,...|2020-08-01|xyz123 |0 |null |0.01|null|
|{'a':{'m':0,'n':{'50':0.005,'60':0,'100':0},'d':0.01,...|2020-08-02|xyz123 |null |0 |0.01|null|
|{'g':{'h':0,'j':{'k':0.005,'':0,'100':0},'d':0.01}} |2020-08-03|xyz123 |null |null |0.02|0 |
-------------------------------------------------------------------------------------------------------
请帮助获取此结果数据框.我面临的主要问题是由于传入json数据没有固定的结构.json数据可以是嵌套形式的任何数据,但我只需要提取给定的四个变量.我在Pandas中通过展平json字符串然后提取4个变量来实现这一点,但是在Spark中却变得越来越困难.
Please help to get this result dataframe. The main problem I am facing is due to no fixed structure for the incoming json data. The json data can be anything in nested form but I need to extract only the given four variables. I have achieved this in Pandas by flattening out the json string and then to extract the 4 variables but in Spark it is getting difficult.
推荐答案
有2种方法可以做到: