根据前面的单词从段落中提取数值 [英] Extracting a numerical value from a paragraph based on preceding words

查看:64
本文介绍了根据前面的单词从段落中提取数值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理列中的一些大文本字段.经过一些清理后,我有如下内容:

I'm working with some big text fields in columns. After some cleanup I have something like below:

truth_val: ["5"]
xerb Scale: ["2"]
perb Scale: ["1"]

我想提取数字 2.我正在尝试匹配字符串xerb Scale"然后提取 2.我尝试将包含 2 的组捕获为 (?:xerb Scale:\s\[\")\d{1} 并尝试通过否定外观排除匹配的组领先,但运气不佳.

I want to extract the number 2. I'm trying to match the string "xerb Scale" and then extract 2. I tried capturing the group including 2 as (?:xerb Scale:\s\[\")\d{1} and tried to exclude the matched group through a negative look ahead but had no luck.

这将在 SQL 查询中进行,我正在尝试通过 REGEXP_EXTRACT() 函数提取数值.此查询是将此信息加载到数据库中的管道的一部分.

This is going to be in a SQL query and I'm trying to extract the numerical value through a REGEXP_EXTRACT() function. This query is part of a pipeline that loads this information into the database.

任何帮助将不胜感激!

推荐答案

你应该匹配你不需要获取的东西,以便为你的匹配设置上下文,你需要匹配和捕获 您需要提取的内容:

You should match what you do not need to obtain in order to set the context for your match, and you need to match and capture what you need to extract:

xerb Scale:\s*\["(\d+)"]
                 ^^^^^  

查看正则表达式演示.在 Presto 中,使用 REGEXP_EXTRACT 得到第一场比赛:

See the regex demo. In Presto, use REGEXP_EXTRACT to get the first match:

SELECT regexp_extract(col, 'xerb Scale:\s*\["(\d+)"]', 1); -- 2
                                                      ^^^

注意 1 参数:

regexp_extract(string, pattern, group) → varchar
string 中查找第一次出现的正则表达式 pattern 并返回 捕获组号 group

regexp_extract(string, pattern, group) → varchar
Finds the first occurrence of the regular expression pattern in string and returns the capturing group number group

这篇关于根据前面的单词从段落中提取数值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆