R正则表达式 - 提取以@符号开头的单词 [英] R regex - extract words beginning with @ symbol

查看:73
本文介绍了R正则表达式 - 提取以@符号开头的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 R 的 stringr 包从推文中提取 twitter 句柄.例如,假设我想获取向量中以A"开头的所有单词.我可以这样做

I'm trying to extract twitter handles from tweets using R's stringr package. For example, suppose I want to get all words in a vector that begin with "A". I can do this like so

library(stringr)

# Get all words that begin with "A"
str_extract_all(c("hAi", "hi Ahello Ame"), "(?<=\\b)A[^\\s]+")

[[1]]
character(0)

[[2]]
[1] "Ahello" "Ame"   

太好了.现在让我们尝试使用@"代替A"

Great. Now let's try the same thing using "@" instead of "A"

str_extract_all(c("h@i", "hi @hello @me"), "(?<=\\b)\\@[^\\s]+")

[[1]]
[1] "@i"

[[2]]
character(0)

为什么这个例子给出了与我预期相反的结果,我该如何解决?

Why does this example give the opposite result that I was expecting and how can I fix it?

推荐答案

看来你的意思是

str_extract_all(c("h@i", "hi @hello @me", "@twitter"), "(?<=^|\\s)@[^\\s]+")
# [[1]]
# character(0)
# [[2]]
# [1] "@hello" "@me" 
# [[3]]
# [1] "@twitter"

正则表达式中的 \b 是一个边界,它出现在字符串中的两个字符之间,其中一个是单词字符,另一个不是单词字符".参见此处.由于空格和@"都是非单词字符,因此@"之前没有边界.

The \b in a regular expression is a boundary and it occurs "Between two characters in the string, where one is a word character and the other is not a word character." see here. Since an space and "@" are both non-word characters, there is no boundary before the "@".

在此修订版中,您可以匹配字符串的开头或空格之后的值.

With this revision you match either the start of the string or values that come after spaces.

这篇关于R正则表达式 - 提取以@符号开头的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆