用于检测大写单词的 Stringr 模式 [英] Stringr pattern to detect capitalized words
问题描述
我正在尝试编写一个函数来检测全部大写的大写单词
I am trying to write a function to detect capitalized words that are all capitalised
目前,代码:
df <- data.frame(title = character(), id = numeric())%>%
add_row(title= "THIS is an EXAMPLE where I DONT get the output i WAS hoping for", id = 6)
df <- df %>%
mutate(sec_code_1 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][1])
, sec_code_2 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][2])
, sec_code_3 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][3]))
df
输出在哪里:
标题 | id | sec_code_1 | sec_code_2 | sec_code_3 |
---|---|---|---|---|
这是一个例子,我没有得到我希望的输出 | 6 | 不要 | WAS |
第一个 3-5 个字母大写的单词是THIS",第二个应该跳过示例(>5)并且是DONT",第三个示例应该是WAS".即:
The first 3-5 letter capitalized word is "THIS", second should skip example (>5) and be "DONT", third example should be "WAS". ie:
标题 | id | sec_code_1 | sec_code_2 | sec_code_3 |
---|---|---|---|---|
这是一个例子,我没有得到我希望的输出 | 6 | 这个 | 不要 | 想要 |
有谁知道我哪里出错了?特别是我如何表示空格或字符串开头"?或空格或字符串结尾"逻辑上使用 stringr.
does anyone know where Im going wrong? specifically how I can denote "space or beginning of string" or "space or end of string" logically using stringr.
推荐答案
如果您使用正则表达式运行代码,您会发现 'THIS'
根本不包含在输出中.>
If you run the code with your regex you'll realise 'THIS'
is not included in the output at all.
str_extract_all(df$title," [A-Z]{3,5} ")[[1]]
#[1] " DONT " " WAS "
这是因为您正在提取带有前导和后置空格的单词.'THIS'
没有滞后空格,因为它是句子的开头,因此不满足正则表达式模式.您可以改用字边界 (\\b
).
This is because you are extracting words with leading and lagging whitespace. 'THIS'
does not have lagging whitespace because it is start of the sentence, hence it does not satisfy the regex pattern. You can use word boundaries (\\b
) instead.
str_extract_all(df$title,"\\b[A-Z]{3,5}\\b")[[1]]
#[1] "THIS" "DONT" "WAS"
如果您在其中使用上述模式,您的代码将起作用.
Your code would work if you use the above pattern in it.
或者你也可以使用:
library(tidyverse)
df %>%
mutate(code = str_extract_all(title,"\\b[A-Z]{3,5}\\b")) %>%
unnest_wider(code) %>%
rename_with(~paste0('sec_code_', seq_along(.)), starts_with('..'))
# title id sec_code_1 sec_code_2 sec_code_3
# <chr> <dbl> <chr> <chr> <chr>
#1 THIS is an EXAMPLE where I DONT get t… 6 THIS DONT WAS
这篇关于用于检测大写单词的 Stringr 模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!