如何在Spark中动态切片Array列? [英] How to dynamically slice an Array column in Spark?

查看:111
本文介绍了如何在Spark中动态切片Array列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Spark 2.4引入了新的SQL函数 slice ,该函数可用于从数组列中提取特定范围的元素.我想根据一个整数列动态定义每行的范围,该列具有我要从该列中选取的元素数量.

Spark 2.4 introduced the new SQL function slice, which can be used extract a certain range of elements from an array column. I want to define that range dynamically per row, based on an Integer column that has the number of elements I want to pick from that column.

但是,仅将列传递给slice函数会失败,该函数似乎期望整数作为起始值和结束值.有没有写UDF的方法?

However, simply passing the column to the slice function fails, the function appears to expect integers for start and end values. Is there a way of doing this without writing a UDF?

以一个示例来形象化该问题:我有一个带有数组列 arr 的数据框,该列在每一行中都有一个看起来像 ['a','b','c'] 的数组.还有一个 end_idx 列,其中包含元素 3 1 2 :

To visualize the problem with an example: I have a dataframe with an array column arr that has in each of the rows an array that looks like ['a', 'b', 'c']. There also is an end_idx column that has elements 3, 1 and 2:

+---------+-------+
|arr      |end_idx|
+---------+-------+
|[a, b, c]|3      |
|[a, b, c]|1      |
|[a, b, c]|2      |
+---------+-------+

我尝试像这样创建新列 arr_trimmed :

I try to create a new column arr_trimmed like this:

import pyspark.sql.functions as F

l = [(['a', 'b', 'c'], 3), (['a', 'b', 'c'], 1), (['a', 'b', 'c'], 2)]
df = spark.createDataFrame(l, ["arr", "end_idx"])

df = df.withColumn("arr_trimmed", F.slice(F.col("arr"), 1, F.col("end_idx")))

我希望这段代码可以用元素 ['a','b','c'] ['a'] ['a','b']

I expect this code to create the new column with elements ['a', 'b', 'c'], ['a'], ['a', 'b']

相反,我收到错误消息 TypeError:列不可迭代.

Instead I get an error TypeError: Column is not iterable.

推荐答案

您可以通过传递如下的SQL表达式来做到这一点:

You can do it by passing a SQL expression as follows:

df.withColumn("arr_trimmed", F.expr("slice(arr, 1, end_idx)"))

这是整个工作示例:

import pyspark.sql.functions as F

l = [(['a', 'b', 'c'], 3), (['a', 'b', 'c'], 1), (['a', 'b', 'c'], 2)]

df = spark.createDataFrame(l, ["arr", "end_idx"])

df.withColumn("arr_trimmed", F.expr("slice(arr, 1, end_idx)")).show(truncate=False)

+---------+-------+-----------+
|arr      |end_idx|arr_trimmed|
+---------+-------+-----------+
|[a, b, c]|3      |[a, b, c]  |
|[a, b, c]|1      |[a]        |
|[a, b, c]|2      |[a, b]     |
+---------+-------+-----------+

这篇关于如何在Spark中动态切片Array列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆