spark数据框:爆炸列表列 [英] spark dataframe: explode list column
本文介绍了spark数据框:爆炸列表列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个来自 Spark Aggregator 的输出,它是 List[Character]
case class Character(name: String, secondName: String,faculty: String)val charColumn = HPAggregator.toColumnval resultDF = someDF.select(charColumn)
所以我的数据框看起来像:
+-----------------------------------------------+|价值 |+-----------------------------------------------+|[[哈利,波特,格兰芬多],[罗恩,韦斯莱...|+-----------------------------------------------+
现在我想把它转换成
+-----------------------------------+|姓名 |第二个名字|教师|+---------------------------------+|哈利|波特|格兰芬多||罗恩 |韦斯莱|格兰芬多|
我怎样才能正确地做到这一点?
解决方案
这可以使用 Explode 和 Split Dataframe 函数来完成.
下面是一个例子:
<预><代码>>>>df = spark.createDataFrame([[[['a','b','c'], ['d','e','f'], ['g','h','i']]]],["col1"])>>>df.show(20, 假)+----------------------------------------------------------------------+|col1 |+----------------------------------------------------------------------+|[WrappedArray(a, b, c), WrappedArray(d, e, f), WrappedArray(g, h, i)]|+----------------------------------------------------------------------+>>>从 pyspark.sql.functions 导入爆炸>>>out_df = df.withColumn("col2",explode(df.col1)).drop('col1')>>>>>>out_df .show()+---------+|列2|+---------+|[a, b, c]||[d, e, f]||[g, h, i]|+---------+>>>out_df.select(out_df.col2[0].alias('c1'), out_df.col2[1].alias('c2'), out_df.col2[2].alias('c3')).show()+---+---+---+|c1|c2|c3|+---+---+---+|一个|乙|| ||d|e|f||克|高|我|+---+---+---+>>>I've got an output from Spark Aggregator which is List[Character]
case class Character(name: String, secondName: String, faculty: String)
val charColumn = HPAggregator.toColumn
val resultDF = someDF.select(charColumn)
So my dataframe looks like:
+-----------------------------------------------+
| value |
+-----------------------------------------------+
|[[harry, potter, gryffindor],[ron, weasley ... |
+-----------------------------------------------+
Now I want to convert it to
+----------------------------------+
| name | second_name | faculty |
+----------------------------------+
| harry | potter | gryffindor |
| ron | weasley | gryffindor |
How can I do that properly?
解决方案
This can be done using Explode and Split Dataframe functions.
Below is an example:
>>> df = spark.createDataFrame([[[['a','b','c'], ['d','e','f'], ['g','h','i']]]],["col1"])
>>> df.show(20, False)
+---------------------------------------------------------------------+
|col1 |
+---------------------------------------------------------------------+
|[WrappedArray(a, b, c), WrappedArray(d, e, f), WrappedArray(g, h, i)]|
+---------------------------------------------------------------------+
>>> from pyspark.sql.functions import explode
>>> out_df = df.withColumn("col2", explode(df.col1)).drop('col1')
>>>
>>> out_df .show()
+---------+
| col2|
+---------+
|[a, b, c]|
|[d, e, f]|
|[g, h, i]|
+---------+
>>> out_df.select(out_df.col2[0].alias('c1'), out_df.col2[1].alias('c2'), out_df.col2[2].alias('c3')).show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| a| b| c|
| d| e| f|
| g| h| i|
+---+---+---+
>>>
这篇关于spark数据框:爆炸列表列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文