R编程中的标签提取功能 [英] Hashtag Extract function in R Programming

查看:13
本文介绍了R编程中的标签提取功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 R 中创建一个主题标签提取函数.这个函数将从帖子中提取一个主题标签,如果有的话,否则会给出一个空白.我的功能就像

I am trying to create an hashtag extraction function in R. This function will extract a hashtags from a post, if there are any, else will give a blank. My function is like

hashtag_extract= function(text){
              match = str_extract_all(text,"#\S+")
              if (match) { 
                 return match
                 }else{
               return ''}}
String="#letsdoit #Tonewbeginnign world is on a new#route

但我的功能不起作用,向我展示了大量错误.比如第一个错误是

But my function is not working, showing me tons of errors.like 1st error is

Error: unexpected symbol in:
      "  if (match) { 
     return match"

所以我想把它应用为

hashatag_extract(string)

答案应该是这样的

#letsdoit  ##Tonewbeginnign   #route

最终我将使用 sapply 将这个函数应用于整个列,这就是 If 部分很重要的原因.请忽略我对 R 的缩进,因为它对 R 并不重要,但每个建议都会有所帮助

And eventually I will use sapply to apply this function on whole column, that's why the If part is important. Please ignore my indentation for R, since its not important for R, but every suggestion will be helpful

推荐答案

  1. Hashtag 正则表达式没那么简单
  2. 我不确定您是否理解主题标签的普遍接受的规则"
  3. 我不相信 str_extract_all() 会返回您认为的内容
  4. 只需使用stringistringr 函数建立在
  5. 之上
  6. 人们需要停止分析推文
  1. Hashtag regexes aren't that simple
  2. I'm not sure you understand the commonly accepted "rules" for hashtags
  3. I do not believe str_extract_all() is returning what you think it is
  4. Just use stringi which stringr functions are built on top of
  5. Folks rly need to stop analyzing tweets

这应该可以处理大多数情况,如果不是全部的话:

This should handle most, if not all, cases:

get_tags <- function(x) {
  # via http://stackoverflow.com/a/5768660/1457051
  twitter_hashtag_regex <- "(^|[^&\p{L}\p{M}\p{Nd}_u200cu200dua67eu05beu05f3u05f4u309bu309cu30a0u30fbu3003u0f0bu0f0cu00b7])(#|uFF03)(?!uFE0F|u20E3)([\p{L}\p{M}\p{Nd}_u200cu200dua67eu05beu05f3u05f4u309bu309cu30a0u30fbu3003u0f0bu0f0cu00b7]*[\p{L}\p{M}][\p{L}\p{M}\p{Nd}_u200cu200dua67eu05beu05f3u05f4u309bu309cu30a0u30fbu3003u0f0bu0f0cu00b7]*)"
  stringi::stri_match_all_regex(x, hashtag_regex) %>% 
    purrr::map(~.[,4]) %>% 
    purrr::flatten_chr()

}

tests <- c("#teste_teste      //underscore accepted",
           "#teste-teste      //Hyphen not accepted",
           "#leof_gfg.sdfsd   //dot not accepted",
           "#f34234@45#6fgh6  // @ not accepted",
           "#leo#leo2#asd     //followed hastag without space ",
           "#6663             // only number accepted",
           "_#asd_            // hashtag can't start or finish with underscore",
           "-#sdfsdf-         // hashtag can't start or finish with hyphen",
           ".#sdfsdf.         // hashtag can't start or finish with dot",
           "#leo_leo__leo__leo____leo // decline followed underline")


get_tags(tests)
##  [1] "teste_teste"              "teste"                   
##  [3] "leof_gfg"                 "f34234"                  
##  [5] "leo"                      NA                        
##  [7] NA                         "sdfsdf"                  
##  [9] "sdfsdf"                   "leo_leo__leo__leo____leo"

your_string <- "#letsdoit #Tonewbeginnign world is on a new#route"

get_tags(your_string)
## [1] "letsdoit"       "Tonewbeginnign"

如果您需要将每组主题标签与每个输入向量分组,则需要调整该函数,但您确实没有提供有关您真正想要完成的内容的太多细节.

You'll need to tweak the function if you need each set of hashtags to be grouped with each input vector but you really didn't provide much detail on what you're really trying to accomplish.

这篇关于R编程中的标签提取功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆