用于检测大写单词的 Stringr 模式 [英] Stringr pattern to detect capitalized words

查看:24
本文介绍了用于检测大写单词的 Stringr 模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一个函数来检测全部大写的大写单词

I am trying to write a function to detect capitalized words that are all capitalised

目前,代码:

df <- data.frame(title = character(), id = numeric())%>%
        add_row(title= "THIS is an EXAMPLE where I DONT get the output i WAS hoping for", id = 6)

df <- df %>%
        mutate(sec_code_1 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][1]) 
               , sec_code_2 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][2]) 
               , sec_code_3 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][3]))
df

输出在哪里:

<头>
标题idsec_code_1sec_code_2sec_code_3
这是一个例子,我没有得到我希望的输出6不要WAS

第一个 3-5 个字母大写的单词是THIS",第二个应该跳过示例(>5)并且是DONT",第三个示例应该是WAS".即:

The first 3-5 letter capitalized word is "THIS", second should skip example (>5) and be "DONT", third example should be "WAS". ie:

<头>
标题idsec_code_1sec_code_2sec_code_3
这是一个例子,我没有得到我希望的输出6这个不要想要

有谁知道我哪里出错了?特别是我如何表示空格或字符串开头"?或空格或字符串结尾"逻辑上使用 stringr.

does anyone know where Im going wrong? specifically how I can denote "space or beginning of string" or "space or end of string" logically using stringr.

推荐答案

如果您使用正则表达式运行代码,您会发现 'THIS' 根本不包含在输出中.

If you run the code with your regex you'll realise 'THIS' is not included in the output at all.

str_extract_all(df$title," [A-Z]{3,5} ")[[1]]
#[1] " DONT " " WAS " 

这是因为您正在提取带有前导和后置空格的单词.'THIS' 没有滞后空格,因为它是句子的开头,因此不满足正则表达式模式.您可以改用字边界 (\\b).

This is because you are extracting words with leading and lagging whitespace. 'THIS' does not have lagging whitespace because it is start of the sentence, hence it does not satisfy the regex pattern. You can use word boundaries (\\b) instead.

str_extract_all(df$title,"\\b[A-Z]{3,5}\\b")[[1]]
#[1] "THIS" "DONT" "WAS"

如果您在其中使用上述模式,您的代码将起作用.

Your code would work if you use the above pattern in it.

或者你也可以使用:

library(tidyverse)

df %>%
  mutate(code = str_extract_all(title,"\\b[A-Z]{3,5}\\b")) %>%
  unnest_wider(code) %>%
  rename_with(~paste0('sec_code_', seq_along(.)), starts_with('..'))

# title                                     id sec_code_1 sec_code_2 sec_code_3
#  <chr>                                  <dbl> <chr>      <chr>      <chr>     
#1 THIS is an EXAMPLE where I DONT get t…     6 THIS       DONT       WAS 

这篇关于用于检测大写单词的 Stringr 模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆