R正则表达式 - 提取以@符号开头的单词 [英] R regex - extract words beginning with @ symbol
问题描述
我正在尝试使用 R 的 stringr 包从推文中提取 twitter 句柄.例如,假设我想获取向量中以A"开头的所有单词.我可以这样做
I'm trying to extract twitter handles from tweets using R's stringr package. For example, suppose I want to get all words in a vector that begin with "A". I can do this like so
library(stringr)
# Get all words that begin with "A"
str_extract_all(c("hAi", "hi Ahello Ame"), "(?<=\\b)A[^\\s]+")
[[1]]
character(0)
[[2]]
[1] "Ahello" "Ame"
太好了.现在让我们尝试使用@"代替A"
Great. Now let's try the same thing using "@" instead of "A"
str_extract_all(c("h@i", "hi @hello @me"), "(?<=\\b)\\@[^\\s]+")
[[1]]
[1] "@i"
[[2]]
character(0)
为什么这个例子给出了与我预期相反的结果,我该如何解决?
Why does this example give the opposite result that I was expecting and how can I fix it?
推荐答案
看来你的意思是
str_extract_all(c("h@i", "hi @hello @me", "@twitter"), "(?<=^|\\s)@[^\\s]+")
# [[1]]
# character(0)
# [[2]]
# [1] "@hello" "@me"
# [[3]]
# [1] "@twitter"
正则表达式中的 \b
是一个边界,它出现在字符串中的两个字符之间,其中一个是单词字符,另一个不是单词字符".参见此处.由于空格和@"都是非单词字符,因此@"之前没有边界.
The \b
in a regular expression is a boundary and it occurs "Between two characters in the string, where one is a word character and the other is not a word character." see here. Since an space and "@" are both non-word characters, there is no boundary before the "@".
在此修订版中,您可以匹配字符串的开头或空格之后的值.
With this revision you match either the start of the string or values that come after spaces.
这篇关于R正则表达式 - 提取以@符号开头的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!