pyspark.数据框中的zip数组 [英] pyspark. zip arrays in a dataframe
本文介绍了pyspark.数据框中的zip数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有以下PySpark DataFrame:
I have the following PySpark DataFrame:
+------+----------------+
| id| data |
+------+----------------+
| 1| [10, 11, 12]|
| 2| [20, 21, 22]|
| 3| [30, 31, 32]|
+------+----------------+
最后,我想拥有以下DataFrame
At the end, I want to have the following DataFrame
+--------+----------------------------------+
| id | data |
+--------+----------------------------------+
| [1,2,3]|[[10,20,30],[11,21,31],[12,22,32]]|
+--------+----------------------------------+
我要这样做.首先,我提取数据数组,如下所示:
I order to do this. First I extract the data arrays as follow:
tmp_array = df_test.select("data").rdd.flatMap(lambda x: x).collect()
a0 = tmp_array[0]
a1 = tmp_array[1]
a2 = tmp_array[2]
samples = zip(a0, a1, a2)
samples1 = sc.parallelize(samples)
通过这种方式,我在sample1中有一个包含内容的RDD
In this way, I have in samples1 an RDD with the content
[[10,20,30],[11,21,31],[12,22,32]]
-
问题1:这是一个好方法吗?
Question 1: Is that a good way to do it?
问题2:如何将RDD重新包含到数据框中?
Question 2: How to include that RDD back into the dataframe?
推荐答案
这里是一种无需序列化为
rdd
或使用udf
即可获得所需输出的方法.您将需要两个常量:Here is a way to get your desired output without serializing to
rdd
or using audf
. You will need two constants:- DataFrame中的行数(
df.count()
) - 数据长度(给定)
使用
pyspark.sql.functions.collect_list()
和pyspark.sql.Column.getItem()
:import pyspark.sql.functions as f dataLength = 3 numRows = df.count() df.select( f.collect_list("id").alias("id"), f.array( [ f.array( [f.collect_list("data").getItem(j).getItem(i) for j in range(numRows)] ) for i in range(dataLength) ] ).alias("data") )\ .show(truncate=False) #+---------+------------------------------------------------------------------------------+ #|id |data | #+---------+------------------------------------------------------------------------------+ #|[1, 2, 3]|[WrappedArray(10, 20, 30), WrappedArray(11, 21, 31), WrappedArray(12, 22, 32)]| #+---------+------------------------------------------------------------------------------+
这篇关于pyspark.数据框中的zip数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- DataFrame中的行数(
查看全文