pyspark.数据框中的zip数组 [英] pyspark. zip arrays in a dataframe

查看:62
本文介绍了pyspark.数据框中的zip数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下PySpark DataFrame:

I have the following PySpark DataFrame:

+------+----------------+
|    id|          data  |
+------+----------------+
|     1|    [10, 11, 12]|
|     2|    [20, 21, 22]|
|     3|    [30, 31, 32]|
+------+----------------+

最后,我想拥有以下DataFrame

At the end, I want to have the following DataFrame

+--------+----------------------------------+
|    id  |          data                    |
+--------+----------------------------------+
| [1,2,3]|[[10,20,30],[11,21,31],[12,22,32]]|
+--------+----------------------------------+

我要这样做.首先,我提取数据数组,如下所示:

I order to do this. First I extract the data arrays as follow:

tmp_array = df_test.select("data").rdd.flatMap(lambda x: x).collect()
a0 = tmp_array[0]
a1 = tmp_array[1]
a2 = tmp_array[2]
samples = zip(a0, a1, a2)
samples1 = sc.parallelize(samples)

通过这种方式,我在sample1中有一个包含内容的RDD

In this way, I have in samples1 an RDD with the content

[[10,20,30],[11,21,31],[12,22,32]]

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆