在 R 中使用 Regex 获取 Twitter @Username [英] Get Twitter @Username with Regex in R

查看:49
本文介绍了在 R 中使用 Regex 获取 Twitter @Username的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在 R 中使用正则表达式从文本字符串中提取 Twitter 用户名?

How can I use regex in R to extract Twitter usernames from a string of text?

我试过了

library(stringr)

theString <- '@foobar Foobar! and @foo (@bar) but not foo@bar.com'

str_extract_all(string=theString,pattern='(?:^|(?:[^-a-zA-Z0-9_]))@([A-Za-z]+[A-Za-z0-9_]+)')

但我最终得到了 @foobar@foo(@bar,其中包含一个不需要的括号.

But I end up with @foobar, @foo and (@bar which contains an unwanted parenthesis.

我怎样才能得到 @foobar@foo@bar 作为输出?

How can I get just @foobar, @foo and @bar as output?

推荐答案

这是一种在 R 中有效的方法:

Here's one method that works in R:

theString <- '@foobar Foobar! and @foo (@bar) but not foo@bar.com'
theString1 <- unlist(strsplit(theString, " "))
regex <- "(^|[^@\\w])@(\\w{1,15})\\b"
idx <- grep(regex, theString1, perl = T)
theString1[idx]
[1] "@foobar" "@foo"    "(@bar)"

如果你想在 R 中使用 @Jerry 的回答:

If you want to use @Jerry's answer in R:

regex <- "@([A-Za-z]+[A-Za-z0-9_]+)(?![A-Za-z0-9_]*\\.)"
idx <- grep(regex, theString1, perl = T)
theString1[idx]
[1] "@foobar" "@foo"    "(@bar)" 

然而,这两种方法都包含您不想要的括号.

Both of these methods include the parenthesis that you don't want, however.

更新这将使您从头到尾没有括号或任何其他类型的标点符号(下划线除外,因为它们允许在用户名中使用)

UPDATE This will get to you start-to-finish with no parentheses or any other kind of punctuation (except underscores, since they're allowed in usernames)

theString <- '@foobar Foobar! and @fo_o (@bar) but not foo@bar.com'
theString1 <- unlist(strsplit(theString, " "))
regex1 <- "(^|[^@\\w])@(\\w{1,15})\\b" # get strings with @
regex2 <- "[^[:alnum:]@_]"             # remove all punctuation except _ and @
users <- gsub(regex2, "", theString1[grep(regex1, theString1, perl = T)])
users

[1] "@foobar" "@fo_o"   "@bar"

这篇关于在 R 中使用 Regex 获取 Twitter @Username的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆