根据字典数据框替换语料库中的单词 [英] Replace words in corpus according to dictionary data frame
问题描述
我有兴趣根据由两列数据框组成的字典替换 tm
语料库对象中的所有单词,其中第一列是要匹配的单词,第二列是替换词.
I am interested in replacing all words in a tm
Corpus object according to a dictionary made of a two columns data frame, where the first column is the word to be matched and the second column is the replacement word.
我坚持使用 translate
功能.我看到了 这个答案,但我无法将其转换为要传递给 tm_map
的函数.
I am stuck with the translate
function. I saw this answer but I can't transform it in a function to be passed to tm_map
.
请考虑以下 MWE
library(tm)
docs <- c("first text", "second text")
corp <- Corpus(VectorSource(docs))
dictionary <- data.frame(word = c('first', 'second', 'text'),
translation = c('primo', 'secondo', 'testo'))
translate <- function(text, dictionary) {
# Would like to replace each word of text with corresponding word in dictionary
}
corp_translated <- tm_map (corp, translate)
inspect(corp_translated)
# Expected result
# A corpus with 2 text documents
#
# The metadata consists of 2 tag-value pairs and a data frame
# Available tags are:
# create_date creator
# Available variables in the data frame are:
# MetaID
# [[1]]
# primo testo
# [[2]]
# secondo testo
推荐答案
我建议不要使用 data.frame
作为字典,因为 data.frame
中的基本对象code>R,一个vector,默认是一个字典.
I would suggest not using a data.frame
for a dictionary, since the basic object in R
, a vector, is a dictionary by default.
dict <- c('primo', 'secondo', 'testo')
names(dict) <- c('first', 'second', 'text')
然后到 "tanslate"
x
,其中 x
可能是 "second"
,您只需使用:
Then to "tanslate"
x
, where x
might be "second"
, you simply use:
dict[[x]]
您甚至不需要包装函数.
You dont even need a wrapper function.
如果要反向翻译,请使用
If you want to translate in the opposite direction, use
name(dict)[names(dict) %in% x]
或者你可以翻字典
dict.flip <- names(dict)
names(dict.flip) <- dict
这篇关于根据字典数据框替换语料库中的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!