在PySpark中提取几个正则表达式匹配 [英] Extracting several regex matches in PySpark

查看:284
本文介绍了在PySpark中提取几个正则表达式匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在处理要在PySpark Dataframe列上运行的正则表达式.

I'm currently working on a regex that I want to run over a PySpark Dataframe's column.

此正则表达式仅可捕获一组,但可以返回几个匹配项.我遇到的问题是,PySpark本机正则表达式的功能(regexp_extract和regexp_replace)似乎仅允许组操作(通过$操作数).

This regex is built to capture only one group, but could return several matches. The problem I encounter is that it seems PySpark native regex's functions (regexp_extract and regexp_replace) only allow for groups manipulation (through the $ operand).

是否有一种本地方法(PySpark函数,没有基于python的 re.findall 的udf)获取与我的正则表达式匹配的子字符串列表(我不是在谈论包含在第一匹配)?

Is there a way to natively (PySpark function, no python's re.findall-based udf) fetch the list of substring matched by my regex (and I am not talking of the groups contained in the first match) ?

我想做这样的事情:

my_regex = '(\w+)'
# Fetch and manipulate the resulting matches, not just the capturing group
df = df.withColumn(df.col_name, regexp_replace('col_name', my_regex, '$1[0] - $2[0]'))

其中$ 1代表数组中的第一个匹配项,依此类推...

With $1 representing the first match as an array, and so on...

您可以尝试以下正则表达式输入,以查看我希望获取的匹配项的示例.

You can try the following regex input to see an example of the matches I wish to fetch.

2 AVENUE DES LAPINOUS

它应该返回4个不同的匹配项,每个匹配项中都有1个分组.

It should return 4 different matches, each with 1 group within.

推荐答案

不幸的是,无法获得Spark中的所有匹配项.您可以使用 idx

Unfortunately, there is no way to get all the matches in spark. You can specify matched index using idx

func.regexp_extract('col', my_regex, idx=1)

有一个未合并的请求,可以在此处

There is an unmerged request for same which can be found here

TL; DR :到目前为止,您将需要为此编写一个UDF

TL;DR: As of now, you will need to write a UDF for this

这篇关于在PySpark中提取几个正则表达式匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆