我有多个匹配的正则表达式提取问题 [英] I have an issue with regex extract with multiple matches

查看:67
本文介绍了我有多个匹配的正则表达式提取问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从字符串 "60 ML of paracetomol and 0.5 ML of XYZ" 中提取 60 ML 和 0.5 ML.该字符串是 spark 数据帧中 X 列的一部分.虽然我能够测试我的正则表达式代码以在正则表达式验证器中提取 60 ML 和 0.5 ML,但我无法使用 regexp_extract 提取它,因为它仅针对第一个匹配项.因此我只得到 60 ML.

I am trying to extract 60 ML and 0.5 ML from the string "60 ML of paracetomol and 0.5 ML of XYZ" . This string is part of a column X in spark dataframe. Though I am able to test my regex code to extract 60 ML and 0.5 ML in regex validator, I am not able to extract it using regexp_extract as it targets only 1st matches. Hence I am getting only 60 ML.

你能建议我使用 UDF 的最佳方法吗?

Can you suggest me the best way of doing it using UDF ?

推荐答案

以下是使用 Python UDF 的方法:

Here is how you can do it with a python UDF:

from pyspark.sql.types import *
from pyspark.sql.functions import *
import re

data = [('60 ML of paracetomol and 0.5 ML of XYZ',)]
df = sc.parallelize(data).toDF('str:string')

# Define the function you want to return
def extract(s)
    all_matches = re.findall(r'\d+(?:.\d+)? ML', s)
    return all_matches

# Create the UDF, note that you need to declare the return schema matching the returned type
extract_udf = udf(extract, ArrayType(StringType()))

# Apply it
df2 = df.withColumn('extracted', extract_udf('str'))

Python UDF 对原生 DataFrame 操作的性能造成显着影响.稍微考虑一下之后,这里是另一种不使用 UDF 的方法.一般的想法是用逗号替换所有不是您想要的文本,然后在逗号上拆分以创建最终值的数组.如果您只想要数字,您可以更新正则表达式以将ML"从捕获组中移除.

Python UDFs take a significant performance hit over native DataFrame operations. After thinking about it a little more, here is another way to do it without using a UDF. The general idea is replace all the text that isn't what you want with commas, then split on comma to create your array of final values. If you only want the numbers you can update the regex's to take 'ML' out of the capture group.

pattern = r'\d+(?:\.\d+)? ML'
split_pattern = r'.*?({pattern})'.format(pattern=pattern)
end_pattern = r'(.*{pattern}).*?$'.format(pattern=pattern)

df2 = df.withColumn('a', regexp_replace('str', split_pattern, '$1,'))
df3 = df2.withColumn('a', regexp_replace('a', end_pattern, '$1'))
df4 = df3.withColumn('a', split('a', r','))

这篇关于我有多个匹配的正则表达式提取问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆