火花.数据帧中的 zip 数组 [英] pyspark. zip arrays in a dataframe
问题描述
我有以下 PySpark 数据帧:
I have the following PySpark DataFrame:
+------+----------------+
| id| data |
+------+----------------+
| 1| [10, 11, 12]|
| 2| [20, 21, 22]|
| 3| [30, 31, 32]|
+------+----------------+
最后,我想拥有以下 DataFrame
At the end, I want to have the following DataFrame
+--------+----------------------------------+
| id | data |
+--------+----------------------------------+
| [1,2,3]|[[10,20,30],[11,21,31],[12,22,32]]|
+--------+----------------------------------+
我命令这样做.首先我提取数据数组如下:
I order to do this. First I extract the data arrays as follow:
tmp_array = df_test.select("data").rdd.flatMap(lambda x: x).collect()
a0 = tmp_array[0]
a1 = tmp_array[1]
a2 = tmp_array[2]
samples = zip(a0, a1, a2)
samples1 = sc.parallelize(samples)
这样,我在samples1中有一个包含内容的RDD
In this way, I have in samples1 an RDD with the content
[[10,20,30],[11,21,31],[12,22,32]]
问题 1:这是一个好的方法吗?
Question 1: Is that a good way to do it?
问题 2:如何将该 RDD 包含回数据帧?
Question 2: How to include that RDD back into the dataframe?
推荐答案
这是一种无需序列化到
rdd
或使用udf
即可获得所需输出的方法.您将需要两个常量:Here is a way to get your desired output without serializing to
rdd
or using audf
. You will need two constants:- DataFrame 中的行数 (
df.count()
) - 数据长度(给定)
使用
pyspark.sql.functions.collect_list()
和pyspark.sql.functions.array()
在双列表推导中挑选出"data"
按照您想要的顺序使用pyspark.sql.Column.getItem()
:Use
pyspark.sql.functions.collect_list()
andpyspark.sql.functions.array()
in a double list comprehension to pick out the elements of"data"
in the order you want usingpyspark.sql.Column.getItem()
:import pyspark.sql.functions as f dataLength = 3 numRows = df.count() df.select( f.collect_list("id").alias("id"), f.array( [ f.array( [f.collect_list("data").getItem(j).getItem(i) for j in range(numRows)] ) for i in range(dataLength) ] ).alias("data") )\ .show(truncate=False) #+---------+------------------------------------------------------------------------------+ #|id |data | #+---------+------------------------------------------------------------------------------+ #|[1, 2, 3]|[WrappedArray(10, 20, 30), WrappedArray(11, 21, 31), WrappedArray(12, 22, 32)]| #+---------+------------------------------------------------------------------------------+
这篇关于火花.数据帧中的 zip 数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- DataFrame 中的行数 (