正则表达式不适用于 R,但适用于网站.文本挖掘 [英] Regular expression not working in R but works on website. Text mining
问题描述
我有一个可在正则表达式网站上运行的正则表达式,但在我将其复制到 R 中时不起作用.以下是重新创建我的数据框的代码:
I have a regex which works on the regular expression website but doesn't work when I copy it in R. Below is the code to recreate my data frame:
text <- data.frame(page = c(1,1,2,3), sen = c(1,2,1,1),
text = c("Dear Mr case 1",
"the value of my property is £500,000.00 and it was built in 1980",
"The protected percentage is 0% for 2 years",
"The interest rate is fixed for 2 years at 4.8%"))
在网站上工作的正则表达式:https://regex101.com/r/OcVN5r/2
regex working on website: https://regex101.com/r/OcVN5r/2
以下是我迄今为止尝试过的 R 代码,但都不起作用.
Below is the R codes I have tried so far and neither works.
library(stringr)
patt = "dear\\s+(mr|mrs|miss|ms)\\b[^£]+(£[\\d,.]+)(?:\\D|\\d(?![\\d.]*%))+([\\d.]+%)(?:\\D|\\d(?![\\d.]*%))+([\\d.]+%)"
str_extract(text, patt)
grepl(pattern = patt, x = text)
我收到一条错误消息,指出正则表达式错误,但它在网站上有效.不知道如何让它在 r 中工作.基本上我试图从文本中提取信息.以下是详细信息:从上面的数据框中,我需要提取以下内容:
I'm getting an error saying the regex is wrong but it works on the website. Not sure how to get it to work in r. Basically I am trying to extract pieces of information from the text. Below are the details: From the above dataframe, I need to extract the following:
1:人的性别.在这种情况下,它将是男性(看着 Mr
)
1: Gender of the person. In this case it would be Male (looking at Mr
)
2:代表属性值的数字.在这种情况下将是 500,000.00.00 英镑
.
2: The number that represents the property value. in this case would be £500,000.00
.
3:受保护的百分比值,在我们的例子中为 0%
.
3: The protected percentage value, which in our case would be 0%
.
4:利率值,在我们的例子中是 4.8%
.
4: The interest rate value and in our case it is 4.8%
.
推荐答案
我认为问题在于您的正则表达式没有提供替代或或"匹配.根据您的项目符号列表查看以下内容
I think the issue is your regex isn't giving alternate or "OR" matches. See below based on your bullet list
library(stringi)
rgx <- "(?<=dear\\s?)(m(r(s)?|s|iss))|\\p{S}([0-9]\\S+)|([0-9]+)((\\.[0-9]{1,})?)\\%"
stri_extract_all_regex(
text$text, rgx, opts_regex = stri_opts_regex(case_insensitive = T)
) %>% unlist()
这给了
[1] "Mr" "£500,000.00" "0%" "4.8%"
模式说:
"(?<=dear\\s?)(m(r(s)?|s|iss))"
= 找到匹配的单词亲爱的出现在先生、女士、太太或小姐之前......但不要捕捉亲爱的或领先的空格|
= OR"\\p{S}([0-9]\\S+)"
= 在符号之后找到出现数字序列的匹配项(请参阅?stringi-search-charclass),直到有一个空格.但是开头一定要有符号|
= OR"([0-9]+)((\\.[0-9]{1,})?)\\%"
= 找到匹配一个数字出现一次或多次,后面可能有一个小数,但会以百分号结尾
"(?<=dear\\s?)(m(r(s)?|s|iss))"
= find a match where the word dear appears before a mr, ms, mrs or miss... but don't capture the dear or the leading space|
= OR"\\p{S}([0-9]\\S+)"
= find a match where a sequence of numbers occurs, after a symbol (see ?stringi-search-charclass), until there is a white space. But It must have a symbol at the beginning|
= OR"([0-9]+)((\\.[0-9]{1,})?)\\%"
= find a match where a number occurs one or more times, that may have a decimal with numbers after it, but will end in a percent sign
这篇关于正则表达式不适用于 R,但适用于网站.文本挖掘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!