如何在PySpark中将字符串转换为字典(JSON)的ArrayType [英] How to cast string to ArrayType of dictionary (JSON) in PySpark
问题描述
尝试将StringType转换为JSON的ArrayType,以生成由CSV生成的数据框.
Trying to cast StringType to ArrayType of JSON for a dataframe generated form CSV.
在Spark2
上使用pyspark
我正在处理的CSV文件;如下-
The CSV file I am dealing with; is as follows -
date,attribute2,count,attribute3
2017-09-03,'attribute1_value1',2,'[{"key":"value","key2":2},{"key":"value","key2":2},{"key":"value","key2":2}]'
2017-09-04,'attribute1_value2',2,'[{"key":"value","key2":20},{"key":"value","key2":25},{"key":"value","key2":27}]'
如上所示,它在文字字符串中包含一个属性"attribute3"
,从技术上讲,这是一列字典(JSON),精确长度为2.
(这是功能不同的输出)
As shown above, it contains one attribute "attribute3"
in literal string, which is technically a list of dictionary(JSON) with exact length of 2.
(This is the output of function distinct)
printSchema()
attribute3: string (nullable = true)
我正尝试将"attribute3"
强制转换为ArrayType
I am trying to cast the "attribute3"
to ArrayType
as follows
temp = dataframe.withColumn(
"attribute3_modified",
dataframe["attribute3"].cast(ArrayType())
)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: __init__() takes at least 2 arguments (1 given)
实际上,ArrayType
需要数据类型作为参数.我尝试使用"json"
,但是没有用.
Indeed, ArrayType
expects datatype as argument. I tried with "json"
, but it did not work.
所需输出-
最后,我需要将attribute3
转换为ArrayType()
或简单的简单Python列表. (我正在尝试避免使用eval
)
Desired Output -
In the end, I need to convert attribute3
to ArrayType()
or plain simple Python list. (I am trying to avoid use of eval
)
如何将其转换为ArrayType
,以便将其视为JSON列表?
How do I convert it to ArrayType
, so that I can treat it as list of JSONs?
我在这里错过了什么吗?
Am I missing anything here?
(文档,未解决这个问题很简单)
(The documentation,does not address this problem in straightforward way)
推荐答案
Use from_json
with a schema that matches the actual data in attribute3
column to convert json to ArrayType:
原始数据框:
df.printSchema()
#root
# |-- date: string (nullable = true)
# |-- attribute2: string (nullable = true)
# |-- count: long (nullable = true)
# |-- attribute3: string (nullable = true)
from pyspark.sql.functions import from_json
from pyspark.sql.types import *
创建模式:
schema = ArrayType(
StructType([StructField("key", StringType()),
StructField("key2", IntegerType())]))
使用from_json
:
df = df.withColumn("attribute3", from_json(df.attribute3, schema))
df.printSchema()
#root
# |-- date: string (nullable = true)
# |-- attribute2: string (nullable = true)
# |-- count: long (nullable = true)
# |-- attribute3: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- key: string (nullable = true)
# | | |-- key2: integer (nullable = true)
df.show(1, False)
#+----------+----------+-----+------------------------------------+
#|date |attribute2|count|attribute3 |
#+----------+----------+-----+------------------------------------+
#|2017-09-03|attribute1|2 |[[value, 2], [value, 2], [value, 2]]|
#+----------+----------+-----+------------------------------------+
这篇关于如何在PySpark中将字符串转换为字典(JSON)的ArrayType的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!