Pyspark:将带有嵌套结构的数组转换为字符串 [英] Pyspark: cast array with nested struct to string

查看:43
本文介绍了Pyspark:将带有嵌套结构的数组转换为字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个名为 Filters 的列的 pyspark 数据框:数组>"

I have pyspark dataframe with a column named Filters: "array>"

我想将我的数据帧保存在 csv 文件中,为此我需要将数组转换为字符串类型.

I want to save my dataframe in csv file, for that i need to cast the array to string type.

我尝试投射它:DF.Filters.tostring()DF.Filters.cast(StringType()),但两种解决方案都会为每个解决方案生成错误消息列中的行过滤器:

I tried to cast it: DF.Filters.tostring() and DF.Filters.cast(StringType()), but both solutions generate error message for each row in the columns Filters:

org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@56234c19

代码如下

from pyspark.sql.types import StringType

DF.printSchema()

|-- ClientNum: string (nullable = true)
|-- Filters: array (nullable = true)
    |-- element: struct (containsNull = true)
          |-- Op: string (nullable = true)
          |-- Type: string (nullable = true)
          |-- Val: string (nullable = true)

DF_cast = DF.select ('ClientNum',DF.Filters.cast(StringType())) 

DF_cast.printSchema()

|-- ClientNum: string (nullable = true)
|-- Filters: string (nullable = true)

DF_cast.show()

| ClientNum | Filters 
|  32103    | org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@d9e517ce
|  218056   | org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@3c744494

示例 JSON 数据:

Sample JSON data:

{"ClientNum":"abc123","Filters":[{"Op":"foo","Type":"bar","Val":"baz"}]}

谢谢!!

推荐答案

我创建了一个示例 JSON 数据集来匹配该架构:

I created a sample JSON dataset to match that schema:

{"ClientNum":"abc123","Filters":[{"Op":"foo","Type":"bar","Val":"baz"}]}

select(s.col("ClientNum"),s.col("Filters").cast(StringType)).show(false)

+---------+------------------------------------------------------------------+
|ClientNum|Filters                                                           |
+---------+------------------------------------------------------------------+
|abc123   |org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@60fca57e|
+---------+------------------------------------------------------------------+

您的问题最好使用使数组变平的explode() 函数来解决,然后是星形扩展符号:

Your problem is best solved using the explode() function which flattens an array, then the star expand notation:

s.selectExpr("explode(Filters) AS structCol").selectExpr("structCol.*").show()
+---+----+---+
| Op|Type|Val|
+---+----+---+
|foo| bar|baz|
+---+----+---+

使其成为以逗号分隔的单列字符串:

To make it a single column string separated by commas:

s.selectExpr("explode(Filters) AS structCol").select(F.expr("concat_ws(',', structCol.*)").alias("single_col")).show()
+-----------+
| single_col|
+-----------+
|foo,bar,baz|
+-----------+

分解数组参考:在 Spark 中展平行

结构"类型的星形展开参考:如何展平火花数据框中的结构?

Star expand reference for "struct" type: How to flatten a struct in a spark dataframe?

这篇关于Pyspark:将带有嵌套结构的数组转换为字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆