在 R 中提取特定文本后面的数字 [英] Extracting a number following specific text in R

查看:56
本文介绍了在 R 中提取特定文本后面的数字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,其中包含一列全文本.我需要捕获某个短语后面的数字(可能是长度从 1 到 4 位的任意数字),即 'Floor Area''floor area'.我的数据将类似于以下内容:

I have a data frame which contains a column full of text. I need to capture the number (can potentially be any number of digits from most likely 1 to 4 digits in length) that follows a certain phrase, namely 'Floor Area' or 'floor area'. My data will look something like the following:

"A beautiful flat on the 3rd floor with floor area: 50 sqm and a lift"
"Newbuild flat. Floor Area: 30 sq.m" 
"6 bed house with floor area 50 sqm, lot area 25 sqm"

如果我试图只提取数字,或者如果我从 sqm 回顾过去,我有时会错误地得到很多区域.如果有人可以帮助我使用 stringr 中的前瞻正则表达式或类似的东西,我将不胜感激.正则表达式对我来说是一个弱点.非常感谢.

If I try to extract just the number or if I look back from sqm I will sometimes get the lot area by mistake.If someone could help me with a lookahead regex or similar in stringr, I'd appreciate it. Regex is a weak point for me. Many thanks in advance.

推荐答案

提取单词前后数字的常用技术是匹配直到单词或数字或数字的所有字符串和 word 同时捕获数字,然后匹配字符串的其余部分并使用 sub 替换为捕获的子字符串:

A common technique to extract a number before or after a word is to match all the string up to the word or number or number and word while capturing the number and then matching the rest of the string and replacing with the captured substring using sub:

# Extract the first number after a word:
as.integer(sub(".*?<WORD_OR_PATTERN_HERE>.*?(\\d+).*", "\\1", x))

# Extract the first number after a word:
as.integer(sub(".*?(\\d+)\\s*<WORD_OR_PATTERN_HERE>.*", "\\1", x))

注意:将 \\d+ 替换为 \\d+(?:\\.\\d+)? 以匹配 int 或 float数字(为了与上面的代码保持一致,请记住将 as.integer 更改为 as.numeric).\\s* 匹配第二个 sub 中的 0 个或多个空格.

NOTE: Replace \\d+ with \\d+(?:\\.\\d+)? to match int or float numbers (to keep consistency with the code above, remember change as.integer to as.numeric). \\s* matches 0 or more whitespace in the second sub.

对于当前场景,可能的解决方案如下

v <- c("A beautiful flat on the 3rd floor with floor area: 50 sqm and a lift","Newbuild flat. Floor Area: 30 sq.m","6 bed house with floor area 50 sqm, lot area 25 sqm")
as.integer(sub("(?i).*?\\bfloor area:?\\s*(\\d+).*", "\\1", v))
# [1] 50 30 50

查看正则表达式演示.

您也可以利用 str_match 的捕获机制从 stringr 获取第二列值 ([,2]):

You may also leverage a capturing mechanism with str_match from stringr and get the second column value ([,2]):

> library(stringr)
> v <- c("A beautiful flat on the 3rd floor with floor area: 50 sqm and a lift","Newbuild flat. Floor Area: 30 sq.m","6 bed house with floor area 50 sqm, lot area 25 sqm")
> as.integer(str_match(v, "(?i)\\bfloor area:?\\s*(\\d+)")[,2])
[1] 50 30 50

查看正则表达式演示.

正则表达式匹配:

  • (?i) - 不区分大小写
  • \\bfloor area:? - 一个完整的词(\b 是一个词边界)floor area 后跟一个可选的 :(出现一次或零次,?)
  • \\s* - 零个或多个空格
  • (\\d+) - 第 1 组(将在 [,2] 中)捕获一个或多个数字
  • (?i) - in a case-insensitive way
  • \\bfloor area:? - a whole word (\b is a word boundary) floor area followed by an optional : (one or zero occurrence, ?)
  • \\s* - zero or more whitespace
  • (\\d+) - Group 1 (will be in [,2]) capturing one or more digits

查看 R 在线演示

这篇关于在 R 中提取特定文本后面的数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆