解析pysppark中数组的每个元素并应用子字符串 [英] Parse through each element of an array in pyspark and apply substring

查看：14 发布时间：2022/4/8 13:21:20 pyspark user-defined-functions

本文介绍了解析pysppark中数组的每个元素并应用子字符串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

您好，我有一个如下所示的具有数组列的焰火数据帧。

我希望循环访问每个元素，并仅提取连字符之前的字符串，然后创建另一列。

+------------------------------+
|array_col                     |
+------------------------------+
|[hello-123, abc-111]          |
|[hello-234, def-22, xyz-33]   |
|[hiiii-111, def2-333, lmn-222]|
+------------------------------+

所需输出；

+------------------------------+--------------------+
|col1                          |new_column          |
+------------------------------+--------------------+
|[hello-123, abc-111]          |[hello, abc]        |
|[hello-234, def-22, xyz-33]   |[hello, def, xyz]   |
|[hiiii-111, def2-333, lmn-222]|[hiiii, def2, lmn]  |
+------------------------------+--------------------+

我正在尝试类似下面的操作，但我无法在UDF中应用正则表达式/子字符串。

cust_udf = udf(lambda arr: [x for x in arr],ArrayType(StringType()))
df1.withColumn('new_column', cust_udf(col("col1")))

有谁能帮帮忙吗？谢谢

推荐答案

从Spark-2.4使用transform高阶函数。

Example:

df.show(10,False)
#+---------------------------+
#|array_col                  |
#+---------------------------+
#|[hello-123, abc-111]       |
#|[hello-234, def-22, xyz-33]|
#+---------------------------+

df.printSchema()
#root
# |-- array_col: array (nullable = true)
# |    |-- element: string (containsNull = true)

from pyspark.sql.functions import *


df.withColumn("new_column",expr('transform(array_col,x -> split(x,"-")[0])')).
show()
#+--------------------+-----------------+
#|           array_col|       new_column|
#+--------------------+-----------------+
#|[hello-123, abc-111]|     [hello, abc]|
#|[hello-234, def-2...|[hello, def, xyz]|
#+--------------------+-----------------+

这篇关于解析pysppark中数组的每个元素并应用子字符串的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

解析pysppark中数组的每个元素并应用子字符串 [英] Parse through each element of an array in pyspark and apply substring

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

解析pysppark中数组的每个元素并应用子字符串 [英] Parse through each element of an array in pyspark and apply substring

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭