使用 Spark 按行和列扩展 JSON 字符串 [英] Using Spark to expand JSON string by rows and columns

查看:38
本文介绍了使用 Spark 按行和列扩展 JSON 字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Spark 的新手并使用 JSON,但我在做一些相当简单的事情时遇到了麻烦(我认为).我试过使用部分解决方案来解决类似的问题,但不能完全正确.我目前有一个 Spark 数据框,其中有几列代表变量.每行都是变量值的唯一组合.然后我有一个应用于每一行的 UDF,它将每一列作为输入,进行一些分析,并为每一行输出一个汇总表作为 JSON 字符串,并将这些结果保存在表的新列中.一些小样本数据看起来像:

I'm new to Spark and working with JSON and I'm having trouble doing something fairly simple (I think). I've tried using parts of solutions to similar questions but can't quite get it right. I currently have a Spark dataframe with several columns representing variables. Each row is a unique combination of variable values. I then have a UDF that is applied to every row which takes each of the columns as input, does some analysis, and outputs a summary table as a JSON string for each row, and saves these this result in a new column of the table. Some small sample data looks like:

+------+-----+------+-------------------------------------------------------------------
|Var 1 |Var 2|Var 3 |JSON Table 
+------+------------+-------------------------------------------------------------------
|True  |10%  |200   |[{"Out_1": "Mean", "Out_2": "25"}, {"Out_1": "Median", "Out_2": "21"}]
|False |15%  |150   |[{"Out_1": "Mean", "Out_2": "19"}, {"Out_1": "Median", "Out_2": "18"}]
|True  |12%  |100   |[{"Out_1": "Mean", "Out_2": "22"}, {"Out_1": "Median", "Out_2": "20"}]

我想将其转换为以下格式:

I would like to transform this into the following format:

+------+-----+------+------+-----+
|Var 1 |Var 2|Var 3 |Out_1 |Out_2| 
+------+------------+------+-----+
|True  |10%  |200   |Mean  |25   |
|True  |10%  |200   |Median|21   |
|False |15%  |150   |Mean  |19   |
|False |15%  |150   |Median|18   |
|True  |12%  |100   |Mean  |22   |
|True  |12%  |100   |Median|20   |

实际上有更多的变量、数百万行和更大的 JSON 字符串,输出更多,但核心问题保持不变.我基本上尝试过获取 JSON 模式并像这样使用 from_json:

In reality there are many more variables, millions of rows and larger JSON strings with more outputs, but the core problem remains the same. I've basically tried getting the JSON schema and using from_json like so:

from pyspark.sql.functions import *
from pyspark.sql.types import *

schema = spark.read.json(df.rdd.map(lambda row: row["JSON Table"])).schema

df = df\
     .withColumn("JSON Table", from_json("JSON Table", schema))\
     .select(col('*'), col('JSON Table.*'))\

df.show()

这似乎正确地获得了 JSON 结构(尽管每个值都被读取为字符串,尽管大多数是整数),但结果数据框是空的,尽管具有正确的列标题.关于如何处理这个问题有什么建议吗?

This appears to get the JSON structure correctly (albeit every value is read as a string though most are integers), but the resulting dataframe is empty, though has the correct column headings. Any advice on how to deal with this?

推荐答案

假设你的 JSON table 列是一个 json string.您可以明确设置您的 schemaexplode(from_json) 然后 select您的列.

Assuming your JSON table column is a json string. You could explicitly set your schema, explode(from_json) and then select your columns.

df.show() #sample dataframe
+-----+-----+-----+----------------------------------------------------------------------+
|Var 1|Var 2|Var 3|JSON Table                                                            |
+-----+-----+-----+----------------------------------------------------------------------+
|true |10%  |200  |[{"Out_1": "Mean", "Out_2": "25"}, {"Out_1": "Median", "Out_2": "21"}]|
|false|15%  |150  |[{"Out_1": "Mean", "Out_2": "19"}, {"Out_1": "Median", "Out_2": "18"}]|
|true |12%  |100  |[{"Out_1": "Mean", "Out_2": "22"}, {"Out_1": "Median", "Out_2": "20"}]|
+-----+-----+-----+----------------------------------------------------------------------+

#sample schema     
#root
 #|-- Var 1: boolean (nullable = true)
 #|-- Var 2: string (nullable = true)
 #|-- Var 3: long (nullable = true)
 #|-- JSON Table: string (nullable = true)


from pyspark.sql import functions as F
from pyspark.sql.types import *

schema = ArrayType(MapType(StringType(),StringType()))

df.withColumn("JSON Table", F.explode(F.from_json("JSON Table", schema)))\
  .select("Var 1","Var 2","Var 3","JSON Table.Out_1","JSON Table.Out_2").show()

+-----+-----+-----+------+-----+
|Var 1|Var 2|Var 3| Out_1|Out_2|
+-----+-----+-----+------+-----+
| true|  10%|  200|  Mean|   25|
| true|  10%|  200|Median|   21|
|false|  15%|  150|  Mean|   19|
|false|  15%|  150|Median|   18|
| true|  12%|  100|  Mean|   22|
| true|  12%|  100|Median|   20|
+-----+-----+-----+------+-----+

这篇关于使用 Spark 按行和列扩展 JSON 字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆