使用Spark通过行和列扩展JSON字符串 [英] Using Spark to expand JSON string by rows and columns

查看:105
本文介绍了使用Spark通过行和列扩展JSON字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Spark的新手,正在使用JSON,但我在做一些相当简单的事情时遇到了麻烦(我认为).我已经尝试过使用部分解决方案来解决类似的问题,但是并不能完全解决问题.我目前有一个Spark数据框,其中有几列代表变量.每行都是变量值的唯一组合.然后,我将UDF应用于每一行,将每一列作为输入,进行一些分析,然后将摘要表作为JSON字符串输出到每一行,并将这些结果保存在表的新列中.一些小的示例数据如下:

I'm new to Spark and working with JSON and I'm having trouble doing something fairly simple (I think). I've tried using parts of solutions to similar questions but can't quite get it right. I currently have a Spark dataframe with several columns representing variables. Each row is a unique combination of variable values. I then have a UDF that is applied to every row which takes each of the columns as input, does some analysis, and outputs a summary table as a JSON string for each row, and saves these this result in a new column of the table. Some small sample data looks like:

+------+-----+------+-------------------------------------------------------------------
|Var 1 |Var 2|Var 3 |JSON Table 
+------+------------+-------------------------------------------------------------------
|True  |10%  |200   |[{"Out_1": "Mean", "Out_2": "25"}, {"Out_1": "Median", "Out_2": "21"}]
|False |15%  |150   |[{"Out_1": "Mean", "Out_2": "19"}, {"Out_1": "Median", "Out_2": "18"}]
|True  |12%  |100   |[{"Out_1": "Mean", "Out_2": "22"}, {"Out_1": "Median", "Out_2": "20"}]

我想将其转换为以下格式:

I would like to transform this into the following format:

+------+-----+------+------+-----+
|Var 1 |Var 2|Var 3 |Out_1 |Out_2| 
+------+------------+------+-----+
|True  |10%  |200   |Mean  |25   |
|True  |10%  |200   |Median|21   |
|False |15%  |150   |Mean  |19   |
|False |15%  |150   |Median|18   |
|True  |12%  |100   |Mean  |22   |
|True  |12%  |100   |Median|20   |

实际上,还有更多的变量,数百万行和具有更多输出的更大的JSON字符串,但核心问题仍然是相同的.我基本上已经尝试过获取JSON模式并像这样使用from_json:

In reality there are many more variables, millions of rows and larger JSON strings with more outputs, but the core problem remains the same. I've basically tried getting the JSON schema and using from_json like so:

from pyspark.sql.functions import *
from pyspark.sql.types import *

schema = spark.read.json(df.rdd.map(lambda row: row["JSON Table"])).schema

df = df\
     .withColumn("JSON Table", from_json("JSON Table", schema))\
     .select(col('*'), col('JSON Table.*'))\

df.show()

这似乎正确地获取了JSON结构(尽管每个值都被读取为字符串,尽管大多数都是整数),但是结果数据框为空,尽管具有正确的列标题.有关如何处理此问题的任何建议?

This appears to get the JSON structure correctly (albeit every value is read as a string though most are integers), but the resulting dataframe is empty, though has the correct column headings. Any advice on how to deal with this?

推荐答案

假定您的 JSON table 列为 json string .您可以明确设置 schema explode(from_json) ,然后设置 select 列.

Assuming your JSON table column is a json string. You could explicitly set your schema, explode(from_json) and then select your columns.

df.show() #sample dataframe
+-----+-----+-----+----------------------------------------------------------------------+
|Var 1|Var 2|Var 3|JSON Table                                                            |
+-----+-----+-----+----------------------------------------------------------------------+
|true |10%  |200  |[{"Out_1": "Mean", "Out_2": "25"}, {"Out_1": "Median", "Out_2": "21"}]|
|false|15%  |150  |[{"Out_1": "Mean", "Out_2": "19"}, {"Out_1": "Median", "Out_2": "18"}]|
|true |12%  |100  |[{"Out_1": "Mean", "Out_2": "22"}, {"Out_1": "Median", "Out_2": "20"}]|
+-----+-----+-----+----------------------------------------------------------------------+

#sample schema     
#root
 #|-- Var 1: boolean (nullable = true)
 #|-- Var 2: string (nullable = true)
 #|-- Var 3: long (nullable = true)
 #|-- JSON Table: string (nullable = true)


from pyspark.sql import functions as F
from pyspark.sql.types import *

schema = ArrayType(MapType(StringType(),StringType()))

df.withColumn("JSON Table", F.explode(F.from_json("JSON Table", schema)))\
  .select("Var 1","Var 2","Var 3","JSON Table.Out_1","JSON Table.Out_2").show()

+-----+-----+-----+------+-----+
|Var 1|Var 2|Var 3| Out_1|Out_2|
+-----+-----+-----+------+-----+
| true|  10%|  200|  Mean|   25|
| true|  10%|  200|Median|   21|
|false|  15%|  150|  Mean|   19|
|false|  15%|  150|Median|   18|
| true|  12%|  100|  Mean|   22|
| true|  12%|  100|Median|   20|
+-----+-----+-----+------+-----+

这篇关于使用Spark通过行和列扩展JSON字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆