如何在 Spark SQL (Dataframes) 中提取数组的切片? [英] How to pull the slice of an array in Spark SQL (Dataframes)?
问题描述
我有一列包含拆分的 http 请求的数组.我将它们过滤为以下两种可能性之一:
I have a column full of arrays containing split http requests. I have them filtered down to one of two possibilities:
|[, courses, 27381...|
|[, courses, 27547...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, courses, 33287...|
|[, courses, 24024...|
在两种数组类型中,从课程"开始是相同的数据和结构.
In both array-types, from 'courses' onward is the same data and structure.
我想使用 case
语句获取数组的切片,其中如果数组的第一个元素是 'api',则获取元素 3 -> 数组的结尾.我试过使用 Python 切片语法 [3:]
和普通 PostgreSQL
语法 [3, n]
where n
是数组的长度.如果不是'api',则取给定的值.
I want to take the slice of the array using a case
statement where if the first element of the array is 'api', then take elements 3 -> end of the array. I've tried using Python slice syntax [3:]
, and normal PostgreSQL
syntax [3, n]
where n
is the length of the array. If it's not 'api', then just take the given value.
我理想的最终结果是一个数组,其中每一行都共享相同的结构,课程位于第一个索引中,以便从那时起更容易解析.
My ideal end-result would be an array where every row shares the same structure, with courses in the first index for easier parsing from that point onwards.
推荐答案
定义一个 UDF
很简单,你做了一个 之前非常相似的问题 所以我不会发布确切的答案让你思考和学习(对于你自己的好).
It's very easy just define a UDF
, you made a very similar question previously so I won't post the exact answer to let you think and learn (for your own good).
from pyspark.sql.functions import udf
df = sc.parallelize([(["ab", "bs", "xd"],), (["bc", "cd", ":x"],)]).toDF()
getUDF = udf(lambda x, y: x[1:] if x[y] == "ab" else x)
df.select(getUDF(col("_1"), lit(0))).show()
+------------------------+
|PythonUDF#<lambda>(_1,0)|
+------------------------+
| [bs, xd]|
| [bc, cd, :x]|
+------------------------+
这篇关于如何在 Spark SQL (Dataframes) 中提取数组的切片?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!