从pyspark中的数据框ArrayType列中获取前N个元素 [英] get first N elements from dataframe ArrayType column in pyspark

查看：44 发布时间：2021/11/14 21:34:42 apache-spark pyspark apache-spark-sql

本文介绍了从pyspark中的数据框ArrayType列中获取前N个元素的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个带有行的火花数据框 -

I have a spark dataframe with rows as -

1   |   [a, b, c]
2   |   [d, e, f]
3   |   [g, h, i]

现在我只想保留数组列中的前 2 个元素.

Now I want to keep only the first 2 elements from the array column.

1   |   [a, b]
2   |   [d, e]
3   |   [g, h]

如何实现?

注意 - 请记住，我不是在此处提取单个数组元素，而是提取可能包含多个元素的数组的一部分.

Note - Remember that I am not extracting a single array element here, but a part of the array which may contain multiple elements.

推荐答案

以下是使用 API 函数的方法.

Here's how to do it with the API functions.

假设您的 DataFrame 如下:

Suppose your DataFrame were the following:

df.show()
#+---+---------+
#| id|  letters|
#+---+---------+
#|  1|[a, b, c]|
#|  2|[d, e, f]|
#|  3|[g, h, i]|
#+---+---------+

df.printSchema()
#root
# |-- id: long (nullable = true)
# |-- letters: array (nullable = true)
# |    |-- element: string (containsNull = true)

您可以使用方括号按索引访问 letters 列中的元素，并将其包装在对 pyspark.sql.functions.array() 的调用中以创建一个新的 ArrayType 列.

You can use square brackets to access elements in the letters column by index, and wrap that in a call to pyspark.sql.functions.array() to create a new ArrayType column.

import pyspark.sql.functions as f

df.withColumn("first_two", f.array([f.col("letters")[0], f.col("letters")[1]])).show()
#+---+---------+---------+
#| id|  letters|first_two|
#+---+---------+---------+
#|  1|[a, b, c]|   [a, b]|
#|  2|[d, e, f]|   [d, e]|
#|  3|[g, h, i]|   [g, h]|
#+---+---------+---------+

或者如果你有太多的索引要列出，你可以使用列表推导式:

Or if you had too many indices to list, you can use a list comprehension:

df.withColumn("first_two", f.array([f.col("letters")[i] for i in range(2)])).show()
#+---+---------+---------+
#| id|  letters|first_two|
#+---+---------+---------+
#|  1|[a, b, c]|   [a, b]|
#|  2|[d, e, f]|   [d, e]|
#|  3|[g, h, i]|   [g, h]|
#+---+---------+---------+

<小时>

对于 pyspark 2.4+ 版本，您还可以使用 pyspark.sql.functions.slice():

df.withColumn("first_two",f.slice("letters",start=1,length=2)).show()
#+---+---------+---------+
#| id|  letters|first_two|
#+---+---------+---------+
#|  1|[a, b, c]|   [a, b]|
#|  2|[d, e, f]|   [d, e]|
#|  3|[g, h, i]|   [g, h]|
#+---+---------+---------+

slice 可能对大数组有更好的性能(注意起始索引是 1，不是 0)

slice may have better performance for large arrays (note that start index is 1, not 0)

这篇关于从pyspark中的数据框ArrayType列中获取前N个元素的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从pyspark中的数据框ArrayType列中获取前N个元素 [英] get first N elements from dataframe ArrayType column in pyspark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从pyspark中的数据框ArrayType列中获取前N个元素 [英] get first N elements from dataframe ArrayType column in pyspark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭