如果包含字符串列表,则过滤 pyspark 数据框 [英] Filter pyspark dataframe if contains a list of strings
问题描述
假设我们有一个 pyspark 数据框,其中一列 (column_a
) 包含一些字符串值,并且还有一个字符串列表 (list_a
).
Suppose that we have a pyspark dataframe that one of its columns (column_a
) contains some string values, and also there is a list of strings (list_a
).
数据框:
column_a | count
some_string | 10
another_one | 20
third_string | 30
list_a:
['string', 'third', ...]
如果 column_a 的值包含 list_a 的项目之一,我想过滤此数据框并仅保留行.
I want to filter this dataframe and only keep the rows if column_a's value contains one of list_a's items.
这是基于单个字符串过滤column_a
的代码:
This is the code that works to filter the column_a
based on a single string:
df['column_a'].like('%string_value%')
但是对于一个字符串列表,我们怎样才能得到相同的结果呢?(保留 column_a 值为 'string', 'third', ... 的行)
But how can we get the same result for a list of strings? (Keep the rows that column_a's value is 'string', 'third', ...)
推荐答案
IIUC,你想返回 column_a
是like"(在 SQL 意义上)任何值的行list_a
.
IIUC, you want to return the rows in which column_a
is "like" (in the SQL sense) any of the values in list_a
.
一种方法是使用functools.reduce
:
from functools import reduce
list_a = ['string', 'third']
df1 = df.where(
reduce(lambda a, b: a|b, (df['column_a'].like('%'+pat+"%") for pat in list_a))
)
df1.show()
#+------------+-----+
#| column_a|count|
#+------------+-----+
#| some_string| 10|
#|third_string| 30|
#+------------+-----+
本质上,您遍历 list_a
中所有可能的字符串以在 like
中进行比较并对结果进行或"运算.这是执行计划:
Essentially you loop over all of the possible strings in list_a
to compare in like
and "OR" the results. Here is the execution plan:
df1.explain()
#== Physical Plan ==
#*(1) Filter (Contains(column_a#0, string) || Contains(column_a#0, third))
#+- Scan ExistingRDD[column_a#0,count#1]
<小时>
另一种选择是使用 pyspark.sql.Column.rlike
而不是 like
.
df2 = df.where(
df['column_a'].rlike("|".join(["(" + pat + ")" for pat in list_a]))
)
df2.show()
#+------------+-----+
#| column_a|count|
#+------------+-----+
#| some_string| 10|
#|third_string| 30|
#+------------+-----+
其中有对应的执行计划:
Which has the corresponding execution plan:
df2.explain()
#== Physical Plan ==
#*(1) Filter (isnotnull(column_a#0) && column_a#0 RLIKE (string)|(third))
#+- Scan ExistingRDD[column_a#0,count#1]
这篇关于如果包含字符串列表,则过滤 pyspark 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!