如果包含字符串列表,则过滤 pyspark 数据框 [英] Filter pyspark dataframe if contains a list of strings

查看:123
本文介绍了如果包含字符串列表,则过滤 pyspark 数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们有一个 pyspark 数据框,其中一列 (column_a) 包含一些字符串值,并且还有一个字符串列表 (list_a).

Suppose that we have a pyspark dataframe that one of its columns (column_a) contains some string values, and also there is a list of strings (list_a).

数据框:

column_a      | count
some_string   |  10
another_one   |  20
third_string  |  30

list_a:

['string', 'third', ...]

如果 column_a 的值包含 list_a 的项目之一,我想过滤此数据框并仅保留行.

I want to filter this dataframe and only keep the rows if column_a's value contains one of list_a's items.

这是基于单个字符串过滤column_a的代码:

This is the code that works to filter the column_a based on a single string:

df['column_a'].like('%string_value%')

但是对于一个字符串列表,我们怎样才能得到相同的结果呢?(保留 column_a 值为 'string', 'third', ... 的行)

But how can we get the same result for a list of strings? (Keep the rows that column_a's value is 'string', 'third', ...)

推荐答案

IIUC,你想返回 column_a 是like"(在 SQL 意义上)任何值的行list_a.

IIUC, you want to return the rows in which column_a is "like" (in the SQL sense) any of the values in list_a.

一种方法是使用functools.reduce:

from functools import reduce

list_a = ['string', 'third']

df1 = df.where(
    reduce(lambda a, b: a|b, (df['column_a'].like('%'+pat+"%") for pat in list_a))
)
df1.show()
#+------------+-----+
#|    column_a|count|
#+------------+-----+
#| some_string|   10|
#|third_string|   30|
#+------------+-----+

本质上,您遍历 list_a 中所有可能的字符串以在 like 中进行比较并对结果进行或"运算.这是执行计划:

Essentially you loop over all of the possible strings in list_a to compare in like and "OR" the results. Here is the execution plan:

df1.explain()
#== Physical Plan ==
#*(1) Filter (Contains(column_a#0, string) || Contains(column_a#0, third))
#+- Scan ExistingRDD[column_a#0,count#1]

<小时>

另一种选择是使用 pyspark.sql.Column.rlike 而不是 like.

df2 = df.where(
    df['column_a'].rlike("|".join(["(" + pat + ")" for pat in list_a]))
)

df2.show()
#+------------+-----+
#|    column_a|count|
#+------------+-----+
#| some_string|   10|
#|third_string|   30|
#+------------+-----+

其中有对应的执行计划:

Which has the corresponding execution plan:

df2.explain()
#== Physical Plan ==
#*(1) Filter (isnotnull(column_a#0) && column_a#0 RLIKE (string)|(third))
#+- Scan ExistingRDD[column_a#0,count#1]

这篇关于如果包含字符串列表,则过滤 pyspark 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆