如何将精选的两个单词短语作为符号包含在tidyText中? [英] How to include select 2-word phrases as tokens in tidytext?

查看:9
本文介绍了如何将精选的两个单词短语作为符号包含在tidyText中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在对一些文本数据进行预处理,以便进一步分析。我使用unnest_tokens()[将文本标记为单数词],但希望保留某些经常出现的两个单词的短语,如"United States"或"Social Security"。如何使用tidyText执行此操作?

tidy_data <- data %>%
                unnest_tokens(word, text) %>%
                anti_join(stop_words)
dput(data[1:6, 1:6])

structure(list(race = c("US House", "US House", "US House", "US House", 
"", "US House"), district = c(8L, 3L, 6L, 17L, 2L, 1L), party = c("Republican", 
"Republican", "Republican", "Republican", "", "Republican"), 
    state = c("AZ", "AZ", "KY", "TX", "IL", "NH"), sponsor = c(4, 
    4, 4, 1, NA, 4), approve = structure(c(1L, 1L, 1L, 4L, NA, 
    1L), .Label = c("no oral statement of approval, authorization", 
    "beginning of the spot", "middle of the spot", "end of the spot"
    ), class = "factor")), row.names = c(NA, 6L), class = "data.frame")

推荐答案

如果我处于这种情况,并且我只需要在分析中保留两个单词短语的简短列表,我会在标记化前后进行一些谨慎的替换。

首先,我会将两个单词的短语替换为可以粘合在一起并且不会被我正在使用的标记化过程分开的短语,例如"united states"to"united_states"

library(tidyverse)
library(tidytext)


df <- tibble(text = c("I live in the United States",
                      "United we stand, divided we fall",
                      "Information security is important!",
                      "I work at the Social Security Administration"))


df_parsed <- df %>%
  mutate(text = str_to_lower(text),
         text = str_replace_all(text, "united states", "united_states"),
         text = str_replace_all(text, "social security", "social_security"))

df_parsed
#> # A tibble: 4 x 1
#>   text                                        
#>   <chr>                                       
#> 1 i live in the united_states                 
#> 2 united we stand, divided we fall            
#> 3 information security is important!          
#> 4 i work at the social_security administration

然后,您可以像正常一样进行标记化,然后,再次使用两个单词的短语替换您刚刚制作的东西,因此"united_states"返回"united states"

df_parsed %>%
  unnest_tokens(word, text) %>%
  mutate(word = case_when(word == "united_states" ~ "united states",
                          word == "social_security" ~ "social security",
                          TRUE ~ word))
#> # A tibble: 21 x 1
#>    word         
#>    <chr>        
#>  1 i            
#>  2 live         
#>  3 in           
#>  4 the          
#>  5 united states
#>  6 united       
#>  7 we           
#>  8 stand        
#>  9 divided      
#> 10 we           
#> # … with 11 more rows

创建于2019-08-03reprex package(v0.3.0)

如果您有一个很长的列表,那么它将变得困难和繁重,然后研究使用二元语法和一元语法标记化的方法可能是有意义的。您可以看到here的一个示例。

这篇关于如何将精选的两个单词短语作为符号包含在tidyText中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆