使用dplyr和stringr从文本中提取单词 [英] Extract words from text using dplyr and stringr
问题描述
我正在尝试寻找一种有效的方法来从数据集中的文本列中提取单词。我使用的方法是
I'm trying to find an effective way to extract words from an text column in a dataset. The approach I'm using is
library(dplyr)
library(stringr)
Text = c("A little bird told me about the dog", "A pig in a poke", "As busy as a bee")
data = as.data.frame(Text)
keywords <- paste0(c("bird", "dog", "pig","wolf","cat", "bee", "turtle"), collapse = "|")
data %>% mutate(Word = str_extract(Text, keywords))
这只是一个例子,但我有更多从每行提取2000个可能的单词。我还不知道要使用哪种方法,但是事实上我将拥有一个大的正则表达式会使事情变慢,或者正则表达式的大小无关紧要?我认为每行不会出现多个单词,但是有一种方法可以在每行出现多个单词时自动创建多列?
It's just an example but I have more than 2000 possible words to extract from each row. I don't know yet another approach to use, but the fact I will have a big regex will make things slow or doesn't matter the size of the regex? I think it will not appear more than one of these words in each row, but there is a way to make multiple columns automatically if more than one word appear in each row?
推荐答案
我们可以使用 str_extract_all
返回列表
,将 list
元素到命名列表或 tibble
并使用 unnest_wider
We can use str_extract_all
to return a list
, convert the list
elements to a named list or tibble
and use unnest_wider
library(purrr)
library(stringr)
library(tidyr)
library(dplyr)
data %>%
mutate(Words = str_extract_all(Text, keywords),
Words = map(Words, ~ as.list(unique(.x)) %>%
set_names(str_c('col', seq_along(.))))) %>%
unnest_wider(Words)
# A tibble: 3 x 3
# Text col1 col2
# <fct> <chr> <chr>
#1 A little bird told me about the dog bird dog
#2 A pig in a poke pig <NA>
#3 As busy as a bee bee <NA>
这篇关于使用dplyr和stringr从文本中提取单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!