如何使用本机R创建文档术语矩阵 [英] How to create a document term matrix using native R
本文介绍了如何使用本机R创建文档术语矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想使用本机R(不带其他插件,例如tm)创建文档术语矩阵.数据的结构如下:
I want to create a document term matrix using native R (without additional plugins such as tm). The data is structured as follows:
Doc1: the test was to test the test
Doc2: we did prepare the exam to test the exam
Doc3: was the test the exam
Doc4: the exam we did prepare was to test the test
Doc5: we were successful so we all passed the exam
我要达到的目标如下:
Term Doc1 Doc2 Doc3 Doc4 Doc5 DF
1 all 0 0 0 0 1 1
2 did 0 1 0 1 0 2
3 exam 0 2 1 1 1 4
4 passed 0 0 0 0 1 1
推荐答案
这是一种方法,但是为什么不使用tm包呢?
Here's an approach but again why not use the tm package?
## Your data
## dat <- structure(list(person = structure(1:5, .Label = c("Doc1", "Doc2",
## "Doc3", "Doc4", "Doc5"), class = "factor"),
## text = c("the test was to test the test",
## "we did prepare the exam to test the exam", "was the test the exam",
## "the exam we did prepare was to test the test",
## "we were successful so we all passed the exam"
## )), .Names = c("doc", "text"), class = "data.frame", row.names = c(NA,
## -5L))
## Function to turn list of vects into sparse matrix
mtabulate <- function(vects) {
lev <- sort(unique(unlist(vects)))
dat <- do.call(rbind, lapply(vects, function(x, lev){
tabulate(factor(x, levels = lev, ordered = TRUE),
nbins = length(lev))}, lev = lev))
colnames(dat) <- sort(lev)
data.frame(dat, check.names = FALSE)
}
out <- lapply(split(dat$text, dat$doc), function(x) {
unlist(strsplit(tolower(x), " "))
})
t(mtabulate(out))
## Doc1 Doc2 Doc3 Doc4 Doc5
## all 0 0 0 0 1
## did 0 1 0 1 0
## exam 0 2 1 1 1
## passed 0 0 0 0 1
## prepare 0 1 0 1 0
## so 0 0 0 0 1
## successful 0 0 0 0 1
## test 3 1 1 2 0
## the 2 2 2 2 1
## to 1 1 0 1 0
## was 1 0 1 1 0
## we 0 1 0 1 2
## were 0 0 0 0 1
这篇关于如何使用本机R创建文档术语矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文