PySpark会“爆炸"列中的dict [英] PySpark "explode" dict in column
本文介绍了PySpark会“爆炸"列中的dict的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我在spark数据框中有一列"true_recoms":
I have a column 'true_recoms' in spark dataframe:
-RECORD 17-----------------------------------------------------------------
item | 20380109
true_recoms | {"5556867":1,"5801144":5,"7397596":21}
我需要分解"此列以获得类似以下内容:
I need to 'explode' this column to get something like this:
item | 20380109
recom_item | 5556867
recom_cnt | 1
..............
item | 20380109
recom_item | 5801144
recom_cnt | 5
..............
item | 20380109
recom_item | 7397596
recom_cnt | 21
我尝试使用from_json,但不起作用:
I've tried to use from_json but its doesnt work:
schema_json = StructType(fields=[
StructField("item", StringType()),
StructField("recoms", StringType())
])
df.select(col("true_recoms"),from_json(col("true_recoms"), schema_json)).show(5)
+--------+--------------------+------+
| item| true_recoms|true_r|
+--------+--------------------+------+
|31746548|{"32731749":3,"31...| [,]|
|17359322|{"17359392":1,"17...| [,]|
|31480894|{"31480598":1,"31...| [,]|
| 7265665|{"7265891":1,"503...| [,]|
|31350949|{"32218698":1,"31...| [,]|
+--------+--------------------+------+
only showing top 5 rows
推荐答案
该架构定义不正确.您声明为带有两个字符串字段的struct
The schema is incorrectly defined. You declare to be as struct
with two string fields
-
item
-
recoms
item
recoms
,而文档中都不存在任何字段.
while neither field is present in the document.
不幸的是,from_json
只能采用return结构或结构数组,因此将其重新定义为
Unfortunately from_json
can take return only structs or array of structs so redefining it as
MapType(StringType(), LongType())
不是一个选择.
我个人会使用udf
from pyspark.sql.functions import udf, explode
import json
@udf("map<string, bigint>")
def parse(s):
try:
return json.loads(s)
except json.JSONDecodeError:
pass
可以这样应用
df = spark.createDataFrame(
[(31746548, """{"5556867":1,"5801144":5,"7397596":21}""")],
("item", "true_recoms")
)
df.select("item", explode(parse("true_recoms")).alias("recom_item", "recom_cnt")).show()
# +--------+----------+---------+
# | item|recom_item|recom_cnt|
# +--------+----------+---------+
# |31746548| 5801144| 5|
# |31746548| 7397596| 21|
# |31746548| 5556867| 1|
# +--------+----------+---------+
这篇关于PySpark会“爆炸"列中的dict的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文