从pyspark的dataframe ArrayType列中获取前N个元素 [英] get first N elements from dataframe ArrayType column in pyspark
问题描述
我有一个火花数据框,其行为-
I have a spark dataframe with rows as -
1 | [a, b, c]
2 | [d, e, f]
3 | [g, h, i]
现在我只想保留数组列中的前2个元素.
Now I want to keep only the first 2 elements from the array column.
1 | [a, b]
2 | [d, e]
3 | [g, h]
如何实现?
注意-请记住,我不是在这里提取单个数组元素,而是提取其中可能包含多个元素的一部分数组.
Note - Remember that I am not extracting a single array element here, but a part of the array which may contain multiple elements.
推荐答案
这是使用API函数的方法.
Here's how to do it with the API functions.
假设您的DataFrame是以下内容:
Suppose your DataFrame were the following:
df.show()
#+---+---------+
#| id| letters|
#+---+---------+
#| 1|[a, b, c]|
#| 2|[d, e, f]|
#| 3|[g, h, i]|
#+---+---------+
df.printSchema()
#root
# |-- id: long (nullable = true)
# |-- letters: array (nullable = true)
# | |-- element: string (containsNull = true)
您可以使用方括号按索引访问letters
列中的元素,并将其包装在对pyspark.sql.functions.array()
的调用中以创建新的ArrayType
列.
You can use square brackets to access elements in the letters
column by index, and wrap that in a call to pyspark.sql.functions.array()
to create a new ArrayType
column.
import pyspark.sql.functions as f
df.withColumn("first_two", f.array([f.col("letters")[0], f.col("letters")[1]])).show()
#+---+---------+---------+
#| id| letters|first_two|
#+---+---------+---------+
#| 1|[a, b, c]| [a, b]|
#| 2|[d, e, f]| [d, e]|
#| 3|[g, h, i]| [g, h]|
#+---+---------+---------+
或者如果您要列出的索引过多,则可以使用列表理解:
Or if you had too many indices to list, you can use a list comprehension:
df.withColumn("first_two", f.array([f.col("letters")[i] for i in range(2)])).show()
#+---+---------+---------+
#| id| letters|first_two|
#+---+---------+---------+
#| 1|[a, b, c]| [a, b]|
#| 2|[d, e, f]| [d, e]|
#| 3|[g, h, i]| [g, h]|
#+---+---------+---------+
对于pyspark 2.4+版本,您还可以使用 pyspark.sql.functions.slice()
:
For pyspark versions 2.4+ you can also use pyspark.sql.functions.slice()
:
df.withColumn("first_two",f.slice("letters",start=1,length=2)).show()
#+---+---------+---------+
#| id| letters|first_two|
#+---+---------+---------+
#| 1|[a, b, c]| [a, b]|
#| 2|[d, e, f]| [d, e]|
#| 3|[g, h, i]| [g, h]|
#+---+---------+---------+
slice
对于大型数组可能具有更好的性能(请注意起始索引为1,而不是0)
slice
may have better performance for large arrays (note that start index is 1, not 0)
这篇关于从pyspark的dataframe ArrayType列中获取前N个元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!