正则表达式不适用于 R,但适用于网站.文本挖掘 [英] Regular expression not working in R but works on website. Text mining

查看:49
本文介绍了正则表达式不适用于 R,但适用于网站.文本挖掘的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个可在正则表达式网站上运行的正则表达式,但在我将其复制到 R 中时不起作用.以下是重新创建我的数据框的代码:

I have a regex which works on the regular expression website but doesn't work when I copy it in R. Below is the code to recreate my data frame:

text <- data.frame(page = c(1,1,2,3), sen = c(1,2,1,1),
                   text = c("Dear Mr case 1",
                            "the value of my property is £500,000.00 and it was built in 1980", 
                            "The protected percentage is 0% for 2 years",
                            "The interest rate is fixed for 2 years at 4.8%"))

在网站上工作的正则表达式:https://regex101.com/r/OcVN5r/2

regex working on website: https://regex101.com/r/OcVN5r/2

以下是我迄今为止尝试过的 R 代码,但都不起作用.

Below is the R codes I have tried so far and neither works.

library(stringr)
patt = "dear\\s+(mr|mrs|miss|ms)\\b[^£]+(£[\\d,.]+)(?:\\D|\\d(?![\\d.]*%))+([\\d.]+%)(?:\\D|\\d(?![\\d.]*%))+([\\d.]+%)"
str_extract(text, patt)
grepl(pattern = patt, x = text)

我收到一条错误消息,指出正则表达式错误,但它在网站上有效.不知道如何让它在 r 中工作.基本上我试图从文本中提取信息.以下是详细信息:从上面的数据框中,我需要提取以下内容:

I'm getting an error saying the regex is wrong but it works on the website. Not sure how to get it to work in r. Basically I am trying to extract pieces of information from the text. Below are the details: From the above dataframe, I need to extract the following:

1:人的性别.在这种情况下,它将是男性(看着 Mr)

1: Gender of the person. In this case it would be Male (looking at Mr)

2:代表属性值的数字.在这种情况下将是 500,000.00.00 英镑.

2: The number that represents the property value. in this case would be £500,000.00.

3:受保护的百分比值,在我们的例子中为 0%.

3: The protected percentage value, which in our case would be 0%.

4:利率值,在我们的例子中是 4.8%.

4: The interest rate value and in our case it is 4.8%.

推荐答案

我认为问题在于您的正则表达式没有提供替代或或"匹配.根据您的项目符号列表查看以下内容

I think the issue is your regex isn't giving alternate or "OR" matches. See below based on your bullet list

library(stringi)
rgx <- "(?<=dear\\s?)(m(r(s)?|s|iss))|\\p{S}([0-9]\\S+)|([0-9]+)((\\.[0-9]{1,})?)\\%"
stri_extract_all_regex(
   text$text, rgx, opts_regex = stri_opts_regex(case_insensitive = T)
) %>% unlist()

这给了

[1] "Mr"          "£500,000.00"      "0%"          "4.8%" 

模式说:

  • "(?<=dear\\s?)(m(r(s)?|s|iss))" = 找到匹配的单词亲爱的出现在先生、女士、太太或小姐之前......但不要捕捉亲爱的或领先的空格
  • | = OR
  • "\\p{S}([0-9]\\S+)" = 在符号之后找到出现数字序列的匹配项(请参阅?stringi-search-charclass),直到有一个空格.但是开头一定要有符号
  • | = OR
  • "([0-9]+)((\\.[0-9]{1,})?)\\%" = 找到匹配一个数字出现一次或多次,后面可能有一个小数,但会以百分号结尾
  • "(?<=dear\\s?)(m(r(s)?|s|iss))" = find a match where the word dear appears before a mr, ms, mrs or miss... but don't capture the dear or the leading space
  • | = OR
  • "\\p{S}([0-9]\\S+)" = find a match where a sequence of numbers occurs, after a symbol (see ?stringi-search-charclass), until there is a white space. But It must have a symbol at the beginning
  • | = OR
  • "([0-9]+)((\\.[0-9]{1,})?)\\%" = find a match where a number occurs one or more times, that may have a decimal with numbers after it, but will end in a percent sign

这篇关于正则表达式不适用于 R,但适用于网站.文本挖掘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆