如何使用新(测试)数据重新创建相同的 DocumentTermMatrix [英] How to recreate same DocumentTermMatrix with new (test) data

查看:16
本文介绍了如何使用新(测试)数据重新创建相同的 DocumentTermMatrix的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有基于文本的训练数据和测试数据.更具体地说,我有两个数据集 - 训练和测试 - 它们都有一列包含文本并且对手头的工作感兴趣.

我在 R 中使用了 tm 包来处理训练数据集中的文本列.去除空格、标点符号和停用词后,我提取了语料库,最后创建了一个 1 克的文档术语矩阵,其中包含每个文档中单词的频率/计数.然后我采用了预先确定的截止值,比如 50,只保留那些计数大于 50 的术语.

在此之后,我使用 DTM 和因变量(存在于训练数据中)训练了一个 GLMNET 模型.到目前为止,一切都进行得很顺利.

但是,当我想根据测试数据或未来可能出现的任何新数据对模型进行评分/预测时,我该如何进行?

具体来说,我想知道的是如何在新数据上创建准确的 DTM?

如果新数据集没有任何与原始训练数据相似的词,那么所有术语的计数应该为零(这很好).但我希望能够在任何新语料库上复制完全相同的 DTM(在结构方面).

任何想法/想法?

解决方案

如果我理解正确,您已经创建了一个 dtm,并且您想从具有相同列(即术语)的新文档创建一个新的 dtm第一个 dtm.如果是这种情况,那么应该通过第一个中的条款对第二个 dtm 进行子设置,可能是这样的:

首先设置一些可重现的数据...

这是你的训练数据...

图书馆(tm)# 制作用于文本挖掘的语料库(数据来自包,用于重现性)数据(粗")corpus1 <- Corpus(VectorSource(crude[1:10]))# 处理文本(你的方法可能不同)skipWords <- function(x) removeWords(x, stopwords("english"))funcs <- list(tolower, removePunctuation, removeNumbers,stripWhitespace,skipWords)粗1 <- tm_map(语料库1,FUN = tm_reduce,tmFuns = funcs)raw1.dtm <- DocumentTermMatrix(crude1, control = list(wordLengths = c(3,10)))

这是你的测试数据...

corpus2 <- Corpus(VectorSource(crude[15:20]))# 处理文本(你的方法可能不同)skipWords <- function(x) removeWords(x, stopwords("english"))funcs <- list(tolower, removePunctuation, removeNumbers,stripWhitespace,skipWords)粗2 <- tm_map(语料库2,FUN = tm_reduce,tmFuns = funcs)raw2.dtm <- DocumentTermMatrix(crude2, control = list(wordLengths = c(3,10)))

这里有你想要的东西:

现在我们只保留训练数据中存在的测试数据中的术语...

# 转换为矩阵进行子集化raw1.dtm.mat <- as.matrix(crude1.dtm) # 训练raw2.dtm.mat <- as.matrix(crude2.dtm) # 测试# 按列名(即术语)或训练数据的子集测试数据xx <- data.frame(crude2.dtm.mat[,intersect(colnames(crude2.dtm.mat),列名(crude1.dtm.mat))])

最后将训练数据中没有在测试数据中的术语的所有空列添加到测试数据中...

# 使用训练数据的列名创建一个空数据框yy <- read.table(textConnection(""), col.names = colnames(crude1.dtm.mat),colClasses = "整数")# 为不存在的术语添加 NAs 的 incols# 测试数据但存在 # 在训练数据中# 遵循以上评论中 SchaunW 的建议图书馆(plyr)zz <- rbind.fill(xx, yy)

所以 zz 是测试文档的数据框,但具有与训练文档相同的结构(即相同的列,尽管其中许多包含 NA,正如 SchaunW 指出的那样).

这符合您的要求吗?

Suppose I have text based training data and testing data. To be more specific, I have two data sets - training and testing - and both of them have one column which contains text and is of interest for the job at hand.

I used tm package in R to process the text column in the training data set. After removing the white spaces, punctuation, and stop words, I stemmed the corpus and finally created a document term matrix of 1 grams containing the frequency/count of the words in each document. I then took a pre-determined cut-off of, say, 50 and kept only those terms that have a count of greater than 50.

Following this, I train a, say, GLMNET model using the DTM and the dependent variable (which was present in the training data). Everything runs smooth and easy till now.

However, how do I proceed when I want to score/predict the model on the testing data or any new data that might come in the future?

Specifically, what I am trying to find out is that how do I create the exact DTM on new data?

If the new data set does not have any of the similar words as the original training data then all the terms should have a count of zero (which is fine). But I want to be able to replicate the exact same DTM (in terms of structure) on any new corpus.

Any ideas/thoughts?

解决方案

If I understand correctly, you have made a dtm, and you want to make a new dtm from new documents that has the same columns (ie. terms) as the first dtm. If that's the case, then it should be a matter of sub-setting the second dtm by the terms in the first, perhaps something like this:

First set up some reproducible data...

This is your training data...

library(tm)
# make corpus for text mining (data comes from package, for reproducibility) 
data("crude")
corpus1 <- Corpus(VectorSource(crude[1:10]))    
# process text (your methods may differ)
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers,
              stripWhitespace, skipWords)
crude1 <- tm_map(corpus1, FUN = tm_reduce, tmFuns = funcs)
crude1.dtm <- DocumentTermMatrix(crude1, control = list(wordLengths = c(3,10))) 

And this is your testing data...

corpus2 <- Corpus(VectorSource(crude[15:20]))  
# process text (your methods may differ)
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers,
              stripWhitespace, skipWords)
crude2 <- tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)
crude2.dtm <- DocumentTermMatrix(crude2, control = list(wordLengths = c(3,10))) 

Here is the bit that does what you want:

Now we keep only the terms in the testing data that are present in the training data...

# convert to matrices for subsetting
crude1.dtm.mat <- as.matrix(crude1.dtm) # training
crude2.dtm.mat <- as.matrix(crude2.dtm) # testing

# subset testing data by colnames (ie. terms) or training data
xx <- data.frame(crude2.dtm.mat[,intersect(colnames(crude2.dtm.mat),
                                           colnames(crude1.dtm.mat))])

Finally add to the testing data all the empty columns for terms in the training data that are not in the testing data...

# make an empty data frame with the colnames of the training data
yy <- read.table(textConnection(""), col.names = colnames(crude1.dtm.mat),
                 colClasses = "integer")

# add incols of NAs for terms absent in the 
# testing data but present # in the training data
# following SchaunW's suggestion in the comments above
library(plyr)
zz <- rbind.fill(xx, yy)

So zz is a data frame of the testing documents, but has the same structure as the training documents (ie. same columns, though many of them contain NA, as SchaunW notes).

Is that along the lines of what you want?

这篇关于如何使用新(测试)数据重新创建相同的 DocumentTermMatrix的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆