PySpark会“爆炸"列中的dict [英] PySpark "explode" dict in column

查看：104 发布时间：2020/9/4 1:27:02 apache-spark pyspark explode

本文介绍了PySpark会“爆炸"列中的dict的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在spark数据框中有一列"true_recoms":

I have a column 'true_recoms' in spark dataframe:

-RECORD 17----------------------------------------------------------------- 
item        | 20380109                                                                                                                                                                  
true_recoms | {"5556867":1,"5801144":5,"7397596":21}

我需要分解"此列以获得类似以下内容:

I need to 'explode' this column to get something like this:

item        | 20380109                                                                                                                                                                  
recom_item  | 5556867
recom_cnt   | 1
..............
item        | 20380109                                                                                                                                                                  
recom_item  | 5801144
recom_cnt   | 5
..............
item        | 20380109                                                                                                                                                                  
recom_item  | 7397596
recom_cnt   | 21

我尝试使用from_json，但不起作用:

I've tried to use from_json but its doesnt work:

    schema_json = StructType(fields=[
        StructField("item", StringType()),
        StructField("recoms", StringType())
    ])
    df.select(col("true_recoms"),from_json(col("true_recoms"), schema_json)).show(5)

+--------+--------------------+------+
|    item|         true_recoms|true_r|
+--------+--------------------+------+
|31746548|{"32731749":3,"31...|   [,]|
|17359322|{"17359392":1,"17...|   [,]|
|31480894|{"31480598":1,"31...|   [,]|
| 7265665|{"7265891":1,"503...|   [,]|
|31350949|{"32218698":1,"31...|   [,]|
+--------+--------------------+------+
only showing top 5 rows

推荐答案

该架构定义不正确.您声明为带有两个字符串字段的struct

The schema is incorrectly defined. You declare to be as struct with two string fields

item
recoms

item
recoms

，而文档中都不存在任何字段.

while neither field is present in the document.

不幸的是，from_json只能采用return结构或结构数组，因此将其重新定义为

Unfortunately from_json can take return only structs or array of structs so redefining it as

MapType(StringType(), LongType())

不是一个选择.

我个人会使用udf

from pyspark.sql.functions import udf, explode
import json

@udf("map<string, bigint>")
def parse(s):
    try:
        return json.loads(s)
    except json.JSONDecodeError:
        pass

可以这样应用

df = spark.createDataFrame(
    [(31746548, """{"5556867":1,"5801144":5,"7397596":21}""")],
    ("item", "true_recoms")
)

df.select("item",  explode(parse("true_recoms")).alias("recom_item", "recom_cnt")).show()
# +--------+----------+---------+
# |    item|recom_item|recom_cnt|
# +--------+----------+---------+
# |31746548|   5801144|        5|
# |31746548|   7397596|       21|
# |31746548|   5556867|        1|
# +--------+----------+---------+

这篇关于PySpark会“爆炸"列中的dict的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

PySpark会“爆炸"列中的dict [英] PySpark "explode" dict in column

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

PySpark会“爆炸"列中的dict [英] PySpark &quot;explode&quot; dict in column

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

PySpark会“爆炸"列中的dict [英] PySpark "explode" dict in column

登录关闭