如何从pyspark中的数组中提取元素 [英] How to extract an element from a array in pyspark
问题描述
我有一个以下类型的数据框
I have a data frame with following type
col1|col2|col3|col4
xxxx|yyyy|zzzz|[1111],[2222]
我希望我的输出遵循类型
I want my output to be following type
col1|col2|col3|col4|col5
xxxx|yyyy|zzzz|1111|2222
我的 col4 是一个数组,我想将其转换为单独的列.需要做什么?
My col4 is an array and I want to convert it to a separate column. What needs to be done?
我看到很多关于 flatmap 的答案,但它们增加了一行,我只想将元组放在另一列但在同一行中
I saw many answers with flatmap but they are increasing a row, I want just the tuple to be put in another column but in the same row
以下是我的实际架构:
root
|-- PRIVATE_IP: string (nullable = true)
|-- PRIVATE_PORT: integer (nullable = true)
|-- DESTINATION_IP: string (nullable = true)
|-- DESTINATION_PORT: integer (nullable = true)
|-- collect_set(TIMESTAMP): array (nullable = true)
| |-- element: string (containsNull = true)
也可以请一些人帮助我解释数据帧和 RDD
Also can please some one help me with explanation on both dataframes and RDD's
推荐答案
创建示例数据:
from pyspark.sql import Row
x = [Row(col1="xx", col2="yy", col3="zz", col4=[123,234])]
rdd = sc.parallelize([Row(col1="xx", col2="yy", col3="zz", col4=[123,234])])
df = spark.createDataFrame(rdd)
df.show()
#+----+----+----+----------+
#|col1|col2|col3| col4|
#+----+----+----+----------+
#| xx| yy| zz|[123, 234]|
#+----+----+----+----------+
使用 getItem
从数组列中提取元素,在您的实际情况下,将 col4
替换为 collect_set(TIMESTAMP)
:>
Use getItem
to extract element from the array column as this, in your actual case replace col4
with collect_set(TIMESTAMP)
:
df = df.withColumn("col5", df["col4"].getItem(1)).withColumn("col4", df["col4"].getItem(0))
df.show()
#+----+----+----+----+----+
#|col1|col2|col3|col4|col5|
#+----+----+----+----+----+
#| xx| yy| zz| 123| 234|
#+----+----+----+----+----+
这篇关于如何从pyspark中的数组中提取元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!