如何在 PySpark 中将字符串转换为字典 (JSON) 的 ArrayType [英] How to cast string to ArrayType of dictionary (JSON) in PySpark

查看:71
本文介绍了如何在 PySpark 中将字符串转换为字典 (JSON) 的 ArrayType的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试将 StringType 转换为 JSON 的 ArrayType,以生成 CSV 格式的数据帧.

Spark2

上使用 pyspark

我正在处理的 CSV 文件;如下-

日期、属性2、计数、属性32017-09-03,'attribute1_value1',2,'[{"key":"value","key2":2},{"key":"value","key2":2},{"key":"value","key2":2}]'2017-09-04,'attribute1_value2',2,'[{"key":"value","key2":20},{"key":"value","key2":25},{"key":"value","key2":27}]'

如上所示,它在文字字符串中包含一个属性"attribute3",从技术上讲,它是一个精确长度为2的字典(JSON)列表.(这是distinct函数的输出)

来自 printSchema()

的片段

attribute3: string (nullable = true)

我正在尝试将 "attribute3" 转换为 ArrayType 如下

temp = dataframe.withColumn("attribute3_modified",数据框[属性3"].cast(ArrayType()))

<块引用>

回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中类型错误:__init__() 需要至少 2 个参数(给定 1 个)

确实,ArrayType 需要数据类型作为参数.我尝试使用 "json",但没有奏效.

所需的输出 -最后,我需要将 attribute3 转换为 ArrayType() 或简单的 Python 列表.(我试图避免使用 eval)

如何将其转换为 ArrayType,以便将其视为 JSON 列表?

我在这里遗漏了什么吗?

(文档,没有解决直接解决这个问题)

解决方案

使用 from_json 具有与 attribute3 列中的实际数据匹配的架构,以将 json 转换为 ArrayType:

原始数据框:

df.printSchema()#根# |-- 日期: 字符串 (nullable = true)# |-- 属性 2: 字符串 (nullable = true)# |-- count: long (nullable = true)# |-- 属性 3: 字符串 (nullable = true)从 pyspark.sql.functions 导入 from_json从 pyspark.sql.types 导入 *

创建架构:

schema = ArrayType(StructType([StructField("key", StringType()),StructField("key2", IntegerType())]))

使用from_json:

df = df.withColumn("attribute3", from_json(df.attribute3, schema))df.printSchema()#根# |-- 日期: 字符串 (nullable = true)# |-- 属性 2: 字符串 (nullable = true)# |-- count: long (nullable = true)# |-- 属性 3: 数组 (nullable = true)# ||-- 元素: struct (containsNull = true)# |||-- 键:字符串(可为空 = 真)# |||-- key2:整数(可为空 = 真)df.show(1, 假)#+------------+------------+-----+---------------------+#|日期|属性2|计数|属性3 |#+------------+------------+-----+---------------------+#|2017-09-03|attribute1|2 |[[value, 2], [value, 2], [value, 2]]|#+------------+------------+-----+---------------------+

Trying to cast StringType to ArrayType of JSON for a dataframe generated form CSV.

Using pyspark on Spark2

The CSV file I am dealing with; is as follows -

date,attribute2,count,attribute3
2017-09-03,'attribute1_value1',2,'[{"key":"value","key2":2},{"key":"value","key2":2},{"key":"value","key2":2}]'
2017-09-04,'attribute1_value2',2,'[{"key":"value","key2":20},{"key":"value","key2":25},{"key":"value","key2":27}]'

As shown above, it contains one attribute "attribute3" in literal string, which is technically a list of dictionary(JSON) with exact length of 2. (This is the output of function distinct)

Snippet from the printSchema()

attribute3: string (nullable = true)

I am trying to cast the "attribute3" to ArrayType as follows

temp = dataframe.withColumn(
    "attribute3_modified",
    dataframe["attribute3"].cast(ArrayType())
)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: __init__() takes at least 2 arguments (1 given)

Indeed, ArrayType expects datatype as argument. I tried with "json", but it did not work.

Desired Output - In the end, I need to convert attribute3 to ArrayType() or plain simple Python list. (I am trying to avoid use of eval)

How do I convert it to ArrayType, so that I can treat it as list of JSONs?

Am I missing anything here?

(The documentation,does not address this problem in straightforward way)

解决方案

Use from_json with a schema that matches the actual data in attribute3 column to convert json to ArrayType:

Original data frame:

df.printSchema()
#root
# |-- date: string (nullable = true)
# |-- attribute2: string (nullable = true)
# |-- count: long (nullable = true)
# |-- attribute3: string (nullable = true)

from pyspark.sql.functions import from_json
from pyspark.sql.types import *

Create the schema:

schema = ArrayType(
    StructType([StructField("key", StringType()), 
                StructField("key2", IntegerType())]))

Use from_json:

df = df.withColumn("attribute3", from_json(df.attribute3, schema))

df.printSchema()
#root
# |-- date: string (nullable = true)
# |-- attribute2: string (nullable = true)
# |-- count: long (nullable = true)
# |-- attribute3: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- key: string (nullable = true)
# |    |    |-- key2: integer (nullable = true)

df.show(1, False)
#+----------+----------+-----+------------------------------------+
#|date      |attribute2|count|attribute3                          |
#+----------+----------+-----+------------------------------------+
#|2017-09-03|attribute1|2    |[[value, 2], [value, 2], [value, 2]]|
#+----------+----------+-----+------------------------------------+

这篇关于如何在 PySpark 中将字符串转换为字典 (JSON) 的 ArrayType的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆