Pyspark:使用字符串格式通过正则表达式过滤数据帧? [英] Pyspark: filter dataframe by regex with string formatting?
问题描述
我已经阅读了几篇关于使用"like"运算符通过包含字符串/表达式的条件来过滤spark数据帧的文章,但是我想知道以下内容是否是在%s中使用%s的最佳实践"所需条件如下:
I've read several posts on using the "like" operator to filter a spark dataframe by the condition of containing a string/expression, but was wondering if the following is a "best-practice" on using %s in the desired condition as follows:
input_path = <s3_location_str>
my_expr = "Arizona.*hot" # a regex expression
dx = sqlContext.read.parquet(input_path) # "keyword" is a field in dx
# is the following correct?
substr = "'%%%s%%'" %my_keyword # escape % via %% to get "%"
dk = dx.filter("keyword like %s" %substr)
# dk should contain rows with keyword values such as "Arizona is hot."
注意
我正在尝试在dx中获取所有包含表达式my_keyword的行.否则,对于完全匹配,我们不需要在周围的百分号'%'.
I'm trying to get all rows in dx that contain the expression my_keyword. Otherwise, for exact matches we wouldn't need surrounding percent signs '%'.
推荐答案
从neeraj的暗示来看,在pyspark中执行此操作的正确方法似乎是:
From neeraj's hint, it seems like the correct way to do this in pyspark is:
expr = "Arizona.*hot"
dk = dx.filter(dx["keyword"].rlike(expr))
请注意,dx.filter($"keyword" ...)
不起作用,因为pyspark(我的版本)似乎不支持现成的$
命名法.
Note that dx.filter($"keyword" ...)
did not work since (my version) of pyspark didn't seem to support the $
nomenclature out of the box.
这篇关于Pyspark:使用字符串格式通过正则表达式过滤数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!