Pyspark:使用字符串格式通过正则表达式过滤数据帧? [英] Pyspark: filter dataframe by regex with string formatting?

查看：21 发布时间：2021/11/14 22:16:25 regex pyspark apache-spark-sql spark-dataframe pyspark-sql

本文介绍了Pyspark:使用字符串格式通过正则表达式过滤数据帧?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经阅读了几篇关于使用like"运算符通过包含字符串/表达式的条件过滤火花数据框的帖子，但想知道以下是否是在 %s 中使用的最佳实践"所需条件如下:

I've read several posts on using the "like" operator to filter a spark dataframe by the condition of containing a string/expression, but was wondering if the following is a "best-practice" on using %s in the desired condition as follows:

input_path = <s3_location_str>
my_expr = "Arizona.*hot"  # a regex expression
dx = sqlContext.read.parquet(input_path)  # "keyword" is a field in dx

# is the following correct?
substr = "'%%%s%%'" %my_keyword  # escape % via %% to get "%"
dk = dx.filter("keyword like %s" %substr)

# dk should contain rows with keyword values such as "Arizona is hot."

注意

我正在尝试获取 dx 中包含表达式 my_keyword 的所有行.否则，对于完全匹配，我们不需要周围的百分号%".

I'm trying to get all rows in dx that contain the expression my_keyword. Otherwise, for exact matches we wouldn't need surrounding percent signs '%'.

推荐答案

根据 neeraj 的提示，在 pyspark 中执行此操作的正确方法似乎是:

From neeraj's hint, it seems like the correct way to do this in pyspark is:

expr = "Arizona.*hot"
dk = dx.filter(dx["keyword"].rlike(expr))

请注意 dx.filter($"keyword" ...) 不起作用，因为(我的版本)pyspark 似乎不支持 $ 命名法开箱即用.

Note that dx.filter($"keyword" ...) did not work since (my version) of pyspark didn't seem to support the $ nomenclature out of the box.

这篇关于Pyspark:使用字符串格式通过正则表达式过滤数据帧?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Pyspark:使用字符串格式通过正则表达式过滤数据帧? [英] Pyspark: filter dataframe by regex with string formatting?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Pyspark:使用字符串格式通过正则表达式过滤数据帧? [英] Pyspark: filter dataframe by regex with string formatting?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭