使用dplyr和stringr从文本中提取单词 [英] Extract words from text using dplyr and stringr

查看:175
本文介绍了使用dplyr和stringr从文本中提取单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试寻找一种有效的方法来从数据集中的文本列中提取单词。我使用的方法是

I'm trying to find an effective way to extract words from an text column in a dataset. The approach I'm using is

library(dplyr)
library(stringr)

Text = c("A little bird told me about the dog", "A pig in a poke", "As busy as a bee")
data = as.data.frame(Text)
keywords <- paste0(c("bird", "dog", "pig","wolf","cat", "bee", "turtle"), collapse = "|")
data %>% mutate(Word = str_extract(Text, keywords))

这只是一个例子,但我有更多从每行提取2000个可能的单词。我还不知道要使用哪种方法,但是事实上我将拥有一个大的正则表达式会使事情变慢,或者正则表达式的大小无关紧要?我认为每行不会出现多个单词,但是有一种方法可以在每行出现多个单词时自动创建多列?

It's just an example but I have more than 2000 possible words to extract from each row. I don't know yet another approach to use, but the fact I will have a big regex will make things slow or doesn't matter the size of the regex? I think it will not appear more than one of these words in each row, but there is a way to make multiple columns automatically if more than one word appear in each row?

推荐答案

我们可以使用 str_extract_all 返回列表,将 list 元素到命名列表或 tibble 并使用 unnest_wider

We can use str_extract_all to return a list, convert the list elements to a named list or tibble and use unnest_wider

library(purrr)
library(stringr)
library(tidyr)
library(dplyr)
data %>% 
  mutate(Words = str_extract_all(Text, keywords),
        Words = map(Words, ~ as.list(unique(.x)) %>% 
                              set_names(str_c('col', seq_along(.))))) %>%
  unnest_wider(Words)
# A tibble: 3 x 3
#  Text                                col1  col2 
#  <fct>                               <chr> <chr>
#1 A little bird told me about the dog bird  dog  
#2 A pig in a poke                     pig   <NA> 
#3 As busy as a bee                    bee   <NA> 

这篇关于使用dplyr和stringr从文本中提取单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆