R编程中的Hashtag Extract函数 [英] Hashtag Extract function in R Programming

查看:128
本文介绍了R编程中的Hashtag Extract函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在R中创建一个hashtag提取函数。这个函数会从帖子中提取一个hashtags,如果有的话,否则会给出一个空白。我的函数就像

  hashtag_extract = function(text){
match = str_extract_all(text,#\\\ \\ S +)
if(match){
return match
} else {
return''}}
String =#letsdoit #Tonewbeginnign world is on a新的#路线

但是我的功能不起作用,显示出大量的错误。比如第一个错误是

 错误:意外符号在:
if(match){
return match

所以我想将它应用为

  hashatag_extract(字符串)

答案应该像

  #letsdoit ## Tonewbeginnign #route 

最后我会用sapply在整列上应用这个函数,这就是为什么If部分很重要,请忽略R的缩进,因为它不是im portant for R,但每个建议都会有所帮助

解决方案


  1. Hashtag正则表达式并非如此简单

  2. 我不确定你是否理解标准中常用的规则
  3. 我不相信 str_extract_all / code>正在返回您认为它的结果

  4. 只需使用 stringi stringr
  5. code>函数建立在
  6. 之上

  7. p>这应该可以处理大部分(即使不是全部)的情况:

      get_tags<  -  function(x){
    #via http://stackoverflow.com/a/5768660/1457051
    twitter_hashtag_regex< - (^ | [^& \\p {L} \\p {M} \\\ \\p {钕} _\\\‌\\\‍\\\꙾\\\־\\\׳\\\״\\\゛\\\゜\\\゠\\\・\\\〃\\\་\\\༌\\\·]) (#| \\\#)(?\\\️ | \\\⃣)([\\p {L} \\ p {M} \\p {钕} _\\\‌\\\‍\\\꙾\\\־\\\׳\\\״\\\゛\\\゜\\\゠\\\・\\\〃\\\་\\ \༌\\\·] * [\\p {L} \\p {M}] [\\p {L} \\p {M} \\p {}的Nd _\\\‌\\\‍\\\꙾\\\־\\\׳\\\״\\\゛\\\゜\\\゠\\\・\\\〃\\\་\\\༌\\\·] *)
    stringi :: stri_match_all_regex(x,hashtag_regex)%>%
    purrr :: map(〜。[,4])%>%
    purrr :: flatten_chr()



    测试< -c(#teste_teste //下划线接受,
    #teste-teste //连字符不被接受,
    #leof_gfg。 sdfsd // dot not accepted,
    #f34234 @ 45#6fgh6 // @ not accepted,
    #leo#leo2#asd //跟随hastag无空格,
    #6663 //只接受数字,
    _#asd_ // hashtag无法启动或fini sh带下划线,
    - #sdfsdf- // hashtag不能以连字符开始或结束,
    。#sdfsdf。 //标签无法以点开始或结束,
    #leo_leo__leo__leo____leo // decline after underline)


    get_tags(测试)
    ## [ 1]teste_testeteste
    ## [3]leof_gfgf34234
    ## [5]leoNA
    ## [7] NAsdfsdf
    ## [9]sdfsdfleo_leo__leo__leo____leo

    your_string< - #letsdoit #Tonewbeginnign世界位于新的#路线

    get_tags(your_string )
    ## [1]letsdoitTonewbeginnign

    您需要如果您需要将每组哈希标签与每个输入向量进行分组,但是您并未提供有关您真正想要完成的功能的详细信息,请调整该功能。


    I am trying to create an hashtag extraction function in R. This function will extract a hashtags from a post, if there are any, else will give a blank. My function is like

    hashtag_extract= function(text){
                  match = str_extract_all(text,"#\\S+")
                  if (match) { 
                     return match
                     }else{
                   return ''}}
    String="#letsdoit #Tonewbeginnign world is on a new#route
    

    But my function is not working, showing me tons of errors.like 1st error is

    Error: unexpected symbol in:
          "  if (match) { 
         return match"
    

    so I want to apply it as

    hashatag_extract(string)
    

    and answer should come like

    #letsdoit  ##Tonewbeginnign   #route
    

    And eventually I will use sapply to apply this function on whole column, that's why the If part is important. Please ignore my indentation for R, since its not important for R, but every suggestion will be helpful

    解决方案

    1. Hashtag regexes aren't that simple
    2. I'm not sure you understand the commonly accepted "rules" for hashtags
    3. I do not believe str_extract_all() is returning what you think it is
    4. Just use stringi which stringr functions are built on top of
    5. Folks rly need to stop analyzing tweets

    This should handle most, if not all, cases:

    get_tags <- function(x) {
      # via http://stackoverflow.com/a/5768660/1457051
      twitter_hashtag_regex <- "(^|[^&\\p{L}\\p{M}\\p{Nd}_\u200c\u200d\ua67e\u05be\u05f3\u05f4\u309b\u309c\u30a0\u30fb\u3003\u0f0b\u0f0c\u00b7])(#|\uFF03)(?!\uFE0F|\u20E3)([\\p{L}\\p{M}\\p{Nd}_\u200c\u200d\ua67e\u05be\u05f3\u05f4\u309b\u309c\u30a0\u30fb\u3003\u0f0b\u0f0c\u00b7]*[\\p{L}\\p{M}][\\p{L}\\p{M}\\p{Nd}_\u200c\u200d\ua67e\u05be\u05f3\u05f4\u309b\u309c\u30a0\u30fb\u3003\u0f0b\u0f0c\u00b7]*)"
      stringi::stri_match_all_regex(x, hashtag_regex) %>% 
        purrr::map(~.[,4]) %>% 
        purrr::flatten_chr()
    
    }
    
    tests <- c("#teste_teste      //underscore accepted",
               "#teste-teste      //Hyphen not accepted",
               "#leof_gfg.sdfsd   //dot not accepted",
               "#f34234@45#6fgh6  // @ not accepted",
               "#leo#leo2#asd     //followed hastag without space ",
               "#6663             // only number accepted",
               "_#asd_            // hashtag can't start or finish with underscore",
               "-#sdfsdf-         // hashtag can't start or finish with hyphen",
               ".#sdfsdf.         // hashtag can't start or finish with dot",
               "#leo_leo__leo__leo____leo // decline followed underline")
    
    
    get_tags(tests)
    ##  [1] "teste_teste"              "teste"                   
    ##  [3] "leof_gfg"                 "f34234"                  
    ##  [5] "leo"                      NA                        
    ##  [7] NA                         "sdfsdf"                  
    ##  [9] "sdfsdf"                   "leo_leo__leo__leo____leo"
    
    your_string <- "#letsdoit #Tonewbeginnign world is on a new#route"
    
    get_tags(your_string)
    ## [1] "letsdoit"       "Tonewbeginnign"
    

    You'll need to tweak the function if you need each set of hashtags to be grouped with each input vector but you really didn't provide much detail on what you're really trying to accomplish.

    这篇关于R编程中的Hashtag Extract函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆