根据字典数据框替换语料库中的单词 [英] Replace words in corpus according to dictionary data frame

查看:21
本文介绍了根据字典数据框替换语料库中的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有兴趣根据由两列数据框组成的字典替换 tm 语料库对象中的所有单词,其中第一列是要匹配的单词,第二列是替换词.

I am interested in replacing all words in a tm Corpus object according to a dictionary made of a two columns data frame, where the first column is the word to be matched and the second column is the replacement word.

我坚持使用 translate 功能.我看到了 这个答案,但我无法将其转换为要传递给 tm_map 的函数.

I am stuck with the translate function. I saw this answer but I can't transform it in a function to be passed to tm_map.

请考虑以下 MWE

library(tm)

docs <- c("first text", "second text")
corp <- Corpus(VectorSource(docs))

dictionary <- data.frame(word = c('first', 'second', 'text'),
                      translation = c('primo', 'secondo', 'testo'))

translate <- function(text, dictionary) {
  # Would like to replace each word of text with corresponding word in dictionary
}

corp_translated <- tm_map (corp, translate)

inspect(corp_translated)

# Expected result

# A corpus with 2 text documents
#
# The metadata consists of 2 tag-value pairs and a data frame
# Available tags are:
#   create_date creator 
# Available variables in the data frame are:
#   MetaID 

# [[1]]
# primo testo

# [[2]]
# secondo testo

推荐答案

我建议不要使用 data.frame 作为字典,因为 data.frame 中的基本对象code>R,一个vector,默认是一个字典.

I would suggest not using a data.frame for a dictionary, since the basic object in R, a vector, is a dictionary by default.

      dict  <- c('primo', 'secondo', 'testo')
names(dict) <- c('first', 'second', 'text')

然后到 "tanslate" x,其中 x 可能是 "second",您只需使用:

Then to "tanslate" x, where x might be "second", you simply use:

   dict[[x]]

您甚至不需要包装函数.

You dont even need a wrapper function.

如果要反向翻译,请使用

If you want to translate in the opposite direction, use

   name(dict)[names(dict) %in% x]

或者你可以翻字典

         dict.flip  <- names(dict)
   names(dict.flip) <- dict

这篇关于根据字典数据框替换语料库中的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆