如何在 Spark 中动态切片数组列? [英] How to dynamically slice an Array column in Spark?

查看:54
本文介绍了如何在 Spark 中动态切片数组列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Spark 2.4 引入了新的 SQL 函数 slice,可用于从数组列中提取特定范围的元素.我想根据一个整数列动态定义每行的范围,该列具有我想从该列中选择的元素数.

Spark 2.4 introduced the new SQL function slice, which can be used extract a certain range of elements from an array column. I want to define that range dynamically per row, based on an Integer column that has the number of elements I want to pick from that column.

但是,简单地将列传递给 slice 函数会失败,该函数似乎需要整数作为起始值和结束值.有没有办法不用写 UDF 就可以做到这一点?

However, simply passing the column to the slice function fails, the function appears to expect integers for start and end values. Is there a way of doing this without writing a UDF?

用一个例子来形象化问题:我有一个带有数组列 arr 的数据框,它在每一行中都有一个看起来像 ['a', 'b', 'c'] 的数组.还有一个 end_idx 列,包含元素 312:

To visualize the problem with an example: I have a dataframe with an array column arr that has in each of the rows an array that looks like ['a', 'b', 'c']. There also is an end_idx column that has elements 3, 1 and 2:

+---------+-------+
|arr      |end_idx|
+---------+-------+
|[a, b, c]|3      |
|[a, b, c]|1      |
|[a, b, c]|2      |
+---------+-------+

我尝试像这样创建一个新列 arr_trimmed:

I try to create a new column arr_trimmed like this:

import pyspark.sql.functions as F

l = [(['a', 'b', 'c'], 3), (['a', 'b', 'c'], 1), (['a', 'b', 'c'], 2)]
df = spark.createDataFrame(l, ["arr", "end_idx"])

df = df.withColumn("arr_trimmed", F.slice(F.col("arr"), 1, F.col("end_idx")))

我希望此代码创建包含元素 ['a', 'b', 'c'], ['a'], 的新列>['a', 'b']

I expect this code to create the new column with elements ['a', 'b', 'c'], ['a'], ['a', 'b']

相反,我收到一个错误TypeError: Column is not iterable.

Instead I get an error TypeError: Column is not iterable.

推荐答案

你可以通过传递一个 SQL 表达式来实现,如下所示:

You can do it by passing a SQL expression as follows:

df.withColumn("arr_trimmed", F.expr("slice(arr, 1, end_idx)"))

这是整个工作示例:

import pyspark.sql.functions as F

l = [(['a', 'b', 'c'], 3), (['a', 'b', 'c'], 1), (['a', 'b', 'c'], 2)]

df = spark.createDataFrame(l, ["arr", "end_idx"])

df.withColumn("arr_trimmed", F.expr("slice(arr, 1, end_idx)")).show(truncate=False)

+---------+-------+-----------+
|arr      |end_idx|arr_trimmed|
+---------+-------+-----------+
|[a, b, c]|3      |[a, b, c]  |
|[a, b, c]|1      |[a]        |
|[a, b, c]|2      |[a, b]     |
+---------+-------+-----------+

这篇关于如何在 Spark 中动态切片数组列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆