pyspark/dataframe-创建一个嵌套结构 [英] pyspark/dataframe - creating a nested structure

查看:127
本文介绍了pyspark/dataframe-创建一个嵌套结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将pyspark与dataframe配合使用,并希望创建如下的嵌套结构

i'm using pyspark with dataframe and would like to create a nested structure as below

之前:

Column 1 | Column 2 | Column 3 
--------------------------------
A    | B   | 1 
A    | B   | 2 
A    | C   | 1 

之后:

Column 1 | Column 4 
--------------------------------
A    | [B : [1,2]] 
A    | [C : [1]]

这可行吗?

推荐答案

我认为您无法获得确切的输出,但可以接近.问题是您在第4列的键名.在Spark中,结构需要具有一组固定的事先已知的列.但是让我们留待以后,首先是聚合:

I don't think you can get that exact output, but you can come close. The problem is your key names for the column 4. In Spark, structs need to have a fixed set of columns known in advance. But let's leave that for later, first, the aggregation:

import pyspark
from pyspark.sql import functions as F

sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)

data = [('A', 'B', 1), ('A', 'B', 2), ('A', 'C', 1)]
columns = ['Column1', 'Column2', 'Column3']

data = spark.createDataFrame(data, columns)

data.createOrReplaceTempView("data")
data.show()

# Result
+-------+-------+-------+
|Column1|Column2|Column3|
+-------+-------+-------+
|      A|      B|      1|
|      A|      B|      2|
|      A|      C|      1|
+-------+-------+-------+

nested = spark.sql("SELECT Column1, Column2, STRUCT(COLLECT_LIST(Column3) AS data) AS Column4 FROM data GROUP BY Column1, Column2")
nested.toJSON().collect()

# Result
['{"Column1":"A","Column2":"C","Column4":{"data":[1]}}',
 '{"Column1":"A","Column2":"B","Column4":{"data":[1,2]}}']

您几乎想要什么,对吧?问题是,如果您事先不知道键名(即列2中的值),Spark将无法确定数据的结构.另外,我不完全确定如何使用列的值作为结构的键,除非您使用UDF(也许带有PIVOT?):

Which is almost what you want, right? The problem is that if you do not know your key names in advance (that is, the values in Column 2), Spark cannot determine the structure of your data. Also, I am not entirely sure how you can use the value of a column as key for a structure unless you use a UDF (maybe with a PIVOT?):

datatype = 'struct<B:array<bigint>,C:array<bigint>>'  # Add any other potential keys here.
@F.udf(datatype)
def replace_struct_name(column2_value, column4_value):
    return {column2_value: column4_value['data']}

nested.withColumn('Column5', replace_struct_name(F.col("Column2"), F.col("Column4"))).toJSON().collect()

# Output
['{"Column1":"A","Column2":"C","Column4":{"C":[1]}}',
 '{"Column1":"A","Column2":"B","Column4":{"B":[1,2]}}']

当然,这样做的缺点是键的数量必须是离散的,并且必须事先知道,否则其他键的值将被忽略.

This of course has the drawback that the number of keys must be discrete and known in advance, otherwise other key values will be silently ignored.

这篇关于pyspark/dataframe-创建一个嵌套结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆