整洁文本格式中的单词替换 [英] Word substitution within tidy text format

查看:26
本文介绍了整洁文本格式中的单词替换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 tidy_text 格式,我正在尝试将字符串emails"和emailing"替换为email".

Hi i'm working with a tidy_text format and i am trying to substitute the strings "emails" and "emailing" into "email".

set.seed(123)
terms <- c("emails are nice", "emailing is fun", "computer freaks", "broken modem")
df <- data.frame(sentence = sample(terms, 100, replace = TRUE))
df
str(df)
df$sentence <- as.character(df$sentence)
tidy_df <- df %>% 
unnest_tokens(word, sentence)

tidy_df %>% 
count(word, sort = TRUE) %>% 
filter( n > 20) %>% 
mutate(word = reorder(word, n)) %>% 
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) + 
coord_flip()

这很好用,但是当我使用时:

this works fine, but when i use:

 tidy_df <- gsub("emailing", "email", tidy_df)

要替换单词并再次运行条形图,我收到以下错误消息:

to substitute words and run the bar chart again i get the following error message:

使用方法错误(group_by_"):没有适用于group_by_"的方法应用于字符"类的对象

Error in UseMethod("group_by_") : no applicable method for 'group_by_' applied to an object of class "character"

有谁知道如何在不改变 tidy_text 的结构/类的情况下轻松替换 tidy_text 格式中的单词?

Does any one know how to easily substitute words within tidy text formats without changing structure/class of the tidy_text?

推荐答案

删除这样的词尾称为 词干,R 中有几个包可以为您做到这一点,如果你愿意.一个是来自 rOpenSci 的 hunspell 包,另一个选项是实现波特算法词干提取的 SnowballC 包.你会像这样实现:

Removing the ends of words like that is called stemming and there are a couple of packages in R that will do that for you, if you'd like. One is the hunspell package from rOpenSci, and another option is the SnowballC package which implements Porter algorithm stemming. You would implement that like so:

library(dplyr)
library(tidytext)
library(SnowballC)

terms <- c("emails are nice", "emailing is fun", "computer freaks", "broken modem")

set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
        unnest_tokens(word, txt) %>%
        mutate(word = wordStem(word))
#> # A tibble: 253 × 1
#>      word
#>     <chr>
#> 1   email
#> 2       i
#> 3     fun
#> 4  broken
#> 5   modem
#> 6   email
#> 7       i
#> 8     fun
#> 9  broken
#> 10  modem
#> # ... with 243 more rows

请注意,它正在提取所有您的文本,并且某些单词看起来不再像真正的单词;你可能关心也可能不关心.

Notice that it is stemming all your text and that some of the words don't look like real words anymore; you may or may not care about that.

如果您不想使用像 SnowballC 或 hunspell 这样的词干分析器来提取所有文本,您可以在 mutate() 中使用 dplyr 的 if_else 来替换特定的单词.

If you don't want to stem all your text using a stemmer like SnowballC or hunspell, you can use dplyr's if_else within mutate() to replace just specific words.

set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
        unnest_tokens(word, txt) %>%
        mutate(word = if_else(word %in% c("emailing", "emails"), "email", word))
#> # A tibble: 253 × 1
#>      word
#>     <chr>
#> 1   email
#> 2      is
#> 3     fun
#> 4  broken
#> 5   modem
#> 6   email
#> 7      is
#> 8     fun
#> 9  broken
#> 10  modem
#> # ... with 243 more rows

或者使用 stringr 包中的 str_replace 可能更有意义.

Or it might make more sense for you to use str_replace from the stringr package.

library(stringr)
set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
        unnest_tokens(word, txt) %>%
        mutate(word = str_replace(word, "email(s|ing)", "email"))
#> # A tibble: 253 × 1
#>      word
#>     <chr>
#> 1   email
#> 2      is
#> 3     fun
#> 4  broken
#> 5   modem
#> 6   email
#> 7      is
#> 8     fun
#> 9  broken
#> 10  modem
#> # ... with 243 more rows

这篇关于整洁文本格式中的单词替换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆