Quanteda软件包,朴素贝叶斯(Naive Bayes):如何预测不同功能的测试数据? [英] Quanteda package, Naive Bayes: How can I predict on different-featured test data?

查看:222
本文介绍了Quanteda软件包,朴素贝叶斯(Naive Bayes):如何预测不同功能的测试数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用quanteda::textmodel_NB创建了一个模型,该模型将文本分为两个类别之一.我将模型拟合到去年夏天的训练数据集上.

I used quanteda::textmodel_NB to create a model that categorizes text into one of two categories. I fit the model on a training data set of data from last summer.

现在,我想在今年夏天使用它来对我们在这里工作的新文本进行分类.我尝试这样做,并收到以下错误:

Now, I am trying to use it this summer to categorize new text we get here at work. I tried doing this and got the following error:

Error in predict.textmodel_NB_fitted(model, test_dfm) : 
feature set in newdata different from that in training set

生成错误的函数中的代码在第157至165行.

我认为发生这种情况是因为训练数据集中的单词与测试数据集中使用的单词不完全匹配.但是为什么会发生此错误?我觉得该模型应该能够处理包含不同功能的数据集,这在实际示例中很有用,因为在应用程序使用中可能总是会发生这种情况.

I assume this occurs because the words in the training data set do not exactly match the words used in the test data set. But why does this error occur? I feel as if—to be useful in real-world examples—the model should be able to handle data sets that contain different features, as this is what will probably always happen in applied use.

所以我的第一个问题是:

So my first question is:

1.这个错误是朴素贝叶斯算法的特性吗?还是函数的作者选择执行此操作?

然后哪个引出我的第二个问题:

Which then leads me to my second question:

2.我该如何解决这个问题?

要解决第二个问题,我提供了可复制的代码(最后一行生成上面的错误):

To get at this second question, I provide reproducible code (the last line generates the error above):

library(quanteda)
library(magrittr)
library(data.table)

train_text <- c("Can random effects apply only to categorical variables?",
                "ANOVA expectation identity",
                "Statistical test for significance in ranking positions",
                "Is Fisher Sharp Null Hypothesis testable?",
                "List major reasons for different results from survival analysis among different studies",
                "How do the tenses and aspects in English correspond temporally to one another?",
                "Is there a correct gender-neutral singular pronoun ("his" vs. "her" vs. "their")?",
                "Are collective nouns always plural, or are certain ones singular?",
                "What’s the rule for using "who" and "whom" correctly?",
                "When is a gerund supposed to be preceded by a possessive adjective/determiner?")

train_class <- factor(c(rep(0,5), rep(1,5)))

train_dfm <- train_text %>% 
  dfm(tolower=TRUE, stem=TRUE, remove=stopwords("english"))

model <- textmodel_NB(train_dfm, train_class)

test_text <- c("Weighted Linear Regression with Proportional Standard Deviations in R",
               "What do significance tests for adjusted means tell us?",
               "How should I punctuate around quotes?",
               "Should I put a comma before the last item in a list?")

test_dfm <- test_text %>% 
  dfm(tolower=TRUE, stem=TRUE, remove=stopwords("english"))

predict(model, test_dfm)

我唯一想做的就是手动使功能相同(我假设这将为对象中不存在的功能填充0),但这会产生新的错误.上面示例的代码是:

The only thing I have thought to do was to manually make the features the same (I assumed that this would fill in 0 for features that are not present in the object), but this generated a new error. The code for the example above is:

model_features <- model$data$x@Dimnames$features # gets the features of the training data

test_features <- test_dfm@Dimnames$features # gets the features of the test data

all_features <- c(model_features, test_features) %>% # combining the two sets of features...
  subset(!duplicated(.)) # ...and getting rid of duplicate features

model$data$x@Dimnames$features <- test_dfm@Dimnames$features <- all_features # replacing features of model and test_dfm with all_features

predict(model, dfm) # new error?

但是,此代码会生成一个 new 错误:

However, this code generates a new error:

Error in if (ncol(object$PcGw) != ncol(newdata)) stop("feature set in newdata different from that in training set") : 
  argument is of length zero

如何将这种朴素的贝叶斯模型应用于具有不同功能的新数据集?

推荐答案

幸运的是,有一个简单的方法可以做到这一点:您可以在测试数据上使用dfm_select()为培训提供相同的功能(功能的顺序)放.就这么简单:

Fortunately there is an easy method to do this: You can use dfm_select() on your test data to give identical features (and ordering of features) to the training set. It's this simple:

test_dfm <- dfm_select(test_dfm, train_dfm)
predict(model, test_dfm)
## Predicted textmodel of type: Naive Bayes
## 
##             lp(0)       lp(1)     Pr(0)  Pr(1) Predicted
## text1  -0.6931472  -0.6931472    0.5000 0.5000         0
## text2 -11.8698712 -13.1879095    0.7889 0.2111         0
## text3  -4.1484118  -3.6635616    0.3811 0.6189         1
## text4  -8.0091415  -8.4257356    0.6027 0.3973         0

这篇关于Quanteda软件包,朴素贝叶斯(Naive Bayes):如何预测不同功能的测试数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆