如何用新的(测试)数据重新创建相同的DocumentTermMatrix [英] How to recreate same DocumentTermMatrix with new (test) data

查看:120
本文介绍了如何用新的(测试)数据重新创建相同的DocumentTermMatrix的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有基于文本的培训数据和测试数据.更具体地说,我有两个数据集-培训和测试-他们两个都有一列,其中包含文本,并且是当前工作所感兴趣的.

Suppose I have text based training data and testing data. To be more specific, I have two data sets - training and testing - and both of them have one column which contains text and is of interest for the job at hand.

我在R中使用了tm包来处理训练数据集中的文本列.除去空格,标点符号和停用词后,我对语料库进行了词干处理,最后创建了一个1克的文档术语矩阵,其中包含每个文档中单词的出现频率/数量.然后,我使用了一个预先确定的临界值(例如50),并且仅保留计数大于50的那些术语.

I used tm package in R to process the text column in the training data set. After removing the white spaces, punctuation, and stop words, I stemmed the corpus and finally created a document term matrix of 1 grams containing the frequency/count of the words in each document. I then took a pre-determined cut-off of, say, 50 and kept only those terms that have a count of greater than 50.

在此之后,我使用DTM和因变量(训练数据中存在)训练了一个GLMNET模型.到现在为止,一切运行起来都很容易.

Following this, I train a, say, GLMNET model using the DTM and the dependent variable (which was present in the training data). Everything runs smooth and easy till now.

但是,当我想根据测试数据或将来可能出现的任何新数据对模型进行评分/预测时,该如何进行?

However, how do I proceed when I want to score/predict the model on the testing data or any new data that might come in the future?

具体来说,我想发现的是如何在新数据上创建准确的DTM?

Specifically, what I am trying to find out is that how do I create the exact DTM on new data?

如果新数据集没有与原始训练数据相似的词,则所有术语的计数应为零(可以).但我希望能够在任何新语料库上复制完全相同的DTM(就结构而言).

If the new data set does not have any of the similar words as the original training data then all the terms should have a count of zero (which is fine). But I want to be able to replicate the exact same DTM (in terms of structure) on any new corpus.

有什么想法/想法吗?

推荐答案

如果我理解正确,那么您已经创建了dtm,并且希望从具有与列(即条款)相同的新文档中创建新的dtm.第一个DTM.如果是这样,那么应该用第一个中的术语来设置第二个dtm,也许是这样的:

If I understand correctly, you have made a dtm, and you want to make a new dtm from new documents that has the same columns (ie. terms) as the first dtm. If that's the case, then it should be a matter of sub-setting the second dtm by the terms in the first, perhaps something like this:

首先设置一些可重复的数据...

First set up some reproducible data...

这是您的训练数据...

This is your training data...

library(tm)
# make corpus for text mining (data comes from package, for reproducibility) 
data("crude")
corpus1 <- Corpus(VectorSource(crude[1:10]))    
# process text (your methods may differ)
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers,
              stripWhitespace, skipWords)
crude1 <- tm_map(corpus1, FUN = tm_reduce, tmFuns = funcs)
crude1.dtm <- DocumentTermMatrix(crude1, control = list(wordLengths = c(3,10))) 

这是您的测试数据...

And this is your testing data...

corpus2 <- Corpus(VectorSource(crude[15:20]))  
# process text (your methods may differ)
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers,
              stripWhitespace, skipWords)
crude2 <- tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)
crude2.dtm <- DocumentTermMatrix(crude2, control = list(wordLengths = c(3,10))) 

以下是满足您需要的内容:

现在我们只保留测试数据中存在于训练数据中的术语...

Now we keep only the terms in the testing data that are present in the training data...

# convert to matrices for subsetting
crude1.dtm.mat <- as.matrix(crude1.dtm) # training
crude2.dtm.mat <- as.matrix(crude2.dtm) # testing

# subset testing data by colnames (ie. terms) or training data
xx <- data.frame(crude2.dtm.mat[,intersect(colnames(crude2.dtm.mat),
                                           colnames(crude1.dtm.mat))])

最后将训练数据中所有不在测试数据中的术语添加到测试数据中的所有空列...

Finally add to the testing data all the empty columns for terms in the training data that are not in the testing data...

# make an empty data frame with the colnames of the training data
yy <- read.table(textConnection(""), col.names = colnames(crude1.dtm.mat),
                 colClasses = "integer")

# add incols of NAs for terms absent in the 
# testing data but present # in the training data
# following SchaunW's suggestion in the comments above
library(plyr)
zz <- rbind.fill(xx, yy)

所以zz是测试文档的数据帧,但是具有与培训文档相同的结构(即相同的列,尽管其中很多列都包含NA,如SchaunW所述).

So zz is a data frame of the testing documents, but has the same structure as the training documents (ie. same columns, though many of them contain NA, as SchaunW notes).

是您想要的吗?

这篇关于如何用新的(测试)数据重新创建相同的DocumentTermMatrix的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆