火花.数据帧中的 zip 数组 [英] pyspark. zip arrays in a dataframe

查看:23
本文介绍了火花.数据帧中的 zip 数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下 PySpark 数据帧:

I have the following PySpark DataFrame:

+------+----------------+
|    id|          data  |
+------+----------------+
|     1|    [10, 11, 12]|
|     2|    [20, 21, 22]|
|     3|    [30, 31, 32]|
+------+----------------+

最后,我想拥有以下 DataFrame

At the end, I want to have the following DataFrame

+--------+----------------------------------+
|    id  |          data                    |
+--------+----------------------------------+
| [1,2,3]|[[10,20,30],[11,21,31],[12,22,32]]|
+--------+----------------------------------+

我命令这样做.首先我提取数据数组如下:

I order to do this. First I extract the data arrays as follow:

tmp_array = df_test.select("data").rdd.flatMap(lambda x: x).collect()
a0 = tmp_array[0]
a1 = tmp_array[1]
a2 = tmp_array[2]
samples = zip(a0, a1, a2)
samples1 = sc.parallelize(samples)

这样,我在samples1中有一个包含内容的RDD

In this way, I have in samples1 an RDD with the content

[[10,20,30],[11,21,31],[12,22,32]]

  • 问题 1:这是一个好的方法吗?

    • Question 1: Is that a good way to do it?

      问题 2:如何将该 RDD 包含回数据帧?

      Question 2: How to include that RDD back into the dataframe?

      推荐答案

      这是一种无需序列化到 rdd 或使用 udf 即可获得所需输出的方法.您将需要两个常量:

      Here is a way to get your desired output without serializing to rdd or using a udf. You will need two constants:

      • DataFrame 中的行数 (df.count())
      • 数据长度(给定)

      使用 pyspark.sql.functions.collect_list()pyspark.sql.functions.array() 在双列表推导中挑选出 "data" 按照您想要的顺序使用 pyspark.sql.Column.getItem():

      Use pyspark.sql.functions.collect_list() and pyspark.sql.functions.array() in a double list comprehension to pick out the elements of "data" in the order you want using pyspark.sql.Column.getItem():

      import pyspark.sql.functions as f
      dataLength = 3
      numRows = df.count()
      df.select(
          f.collect_list("id").alias("id"),
          f.array(
              [
                  f.array(
                      [f.collect_list("data").getItem(j).getItem(i) 
                       for j in range(numRows)]
                  ) 
                  for i in range(dataLength)
              ]
          ).alias("data")
      )\
      .show(truncate=False)
      #+---------+------------------------------------------------------------------------------+
      #|id       |data                                                                          |
      #+---------+------------------------------------------------------------------------------+
      #|[1, 2, 3]|[WrappedArray(10, 20, 30), WrappedArray(11, 21, 31), WrappedArray(12, 22, 32)]|
      #+---------+------------------------------------------------------------------------------+
      

      这篇关于火花.数据帧中的 zip 数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆