Pyspark;检查列是否包含列表元素之一的UDF [英] Pyspark; UDF that checks if a column contains one of elements of a list
本文介绍了Pyspark;检查列是否包含列表元素之一的UDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个数据帧,我想检查它的列中是否至少包含一个关键字:
from pyspark.sql import types as T
import pyspark.sql.functions as fn
key_labels = ["COMMISSION", "COM", "PRET", "LOAN"]
def containsAny(string, array):
if len(string) == 0:
return False
else:
return (any(word in string for word in array))
contains_udf = fn.udf(containsAny, T.BooleanType())
df = spark.createDataFrame([("COMMISSION", "1"), ("CAMMISSION", "2")], ("original", "id"))
df.withColumn("keyword_match", contains_udf(fn.col("original"),key_labels)).show()
当我运行此代码时,我收到以下错误:
Py4JError:调用z:org.apache.park k.sql.unctions.col时出错。
跟踪:py4j.Py4J异常:
方法ol([类java.util.ArrayList])不存在
我做错了什么?
推荐答案
若要使函数正常工作,应创建要比较的数组列:
df.select(fn.array([fn.lit(i) for i in key_labels])).show(truncate=False)
+----------------------------------+
|array(COMMISSION, COM, PRET, LOAN)|
+----------------------------------+
|[COMMISSION, COM, PRET, LOAN] |
|[COMMISSION, COM, PRET, LOAN] |
+----------------------------------+
因此您的代码将如下所示:
def containsAny(string, array):
if len(string) == 0:
return False
else:
return (any(word in string for word in array))
contains_udf = fn.udf(containsAny, T.BooleanType())
(df.withColumn("keyword_match", contains_udf(fn.col("original"),
fn.array([fn.lit(i) for i in key_labels])))).show()
输出:
+----------+---+-------------+
| original| id|keyword_match|
+----------+---+-------------+
|COMMISSION| 1| true|
|CAMMISSION| 2| false|
+----------+---+-------------+
不过,您也可以使用isin
:
df.withColumn('keyword_match',df['original'].isin(key_labels)).show()
+----------+---+-------------+
| original| id|keyword_match|
+----------+---+-------------+
|COMMISSION| 1| true|
|CAMMISSION| 2| false|
+----------+---+-------------+
这篇关于Pyspark;检查列是否包含列表元素之一的UDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文