pyspark.数据框中的zip数组 [英] pyspark. zip arrays in a dataframe

查看：62 发布时间：2021/4/8 19:58:20 apache-spark pyspark apache-spark-sql

本文介绍了pyspark.数据框中的zip数组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下PySpark DataFrame:

I have the following PySpark DataFrame:

+------+----------------+
|    id|          data  |
+------+----------------+
|     1|    [10, 11, 12]|
|     2|    [20, 21, 22]|
|     3|    [30, 31, 32]|
+------+----------------+

最后，我想拥有以下DataFrame

At the end, I want to have the following DataFrame

+--------+----------------------------------+
|    id  |          data                    |
+--------+----------------------------------+
| [1,2,3]|[[10,20,30],[11,21,31],[12,22,32]]|
+--------+----------------------------------+

我要这样做.首先，我提取数据数组，如下所示:

I order to do this. First I extract the data arrays as follow:

tmp_array = df_test.select("data").rdd.flatMap(lambda x: x).collect()
a0 = tmp_array[0]
a1 = tmp_array[1]
a2 = tmp_array[2]
samples = zip(a0, a1, a2)
samples1 = sc.parallelize(samples)

通过这种方式，我在sample1中有一个包含内容的RDD

In this way, I have in samples1 an RDD with the content

[[10,20,30],[11,21,31],[12,22,32]]

问题1:这是一个好方法吗?
- Question 1: Is that a good way to do it?
  
  问题2:如何将RDD重新包含到数据框中?
  
  Question 2: How to include that RDD back into the dataframe?
  
  推荐答案
  
  这里是一种无需序列化为 rdd 或使用 udf 即可获得所需输出的方法.您将需要两个常量:
  
  Here is a way to get your desired output without serializing to rdd or using a udf. You will need two constants:
  - DataFrame中的行数( df.count())
  - 数据长度(给定)
  使用 pyspark.sql.functions.collect_list() 和
  这篇关于pyspark.数据框中的zip数组的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pyspark.数据框中的zip数组 [英] pyspark. zip arrays in a dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

pyspark.数据框中的zip数组 [英] pyspark. zip arrays in a dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭