Quanteda 与插入符号中的朴素贝叶斯:结果截然不同 [英] Naive Bayes in Quanteda vs caret: wildly different results
问题描述
我正在尝试将 quanteda
和 caret
包结合使用,以根据训练样本对文本进行分类.作为测试运行,我想将 quanteda
的内置朴素贝叶斯分类器与 caret
中的分类器进行比较.但是,我似乎无法让 caret
正常工作.
I'm trying to use the packages quanteda
and caret
together to classify text based on a trained sample. As a test run, I wanted to compare the build-in naive bayes classifier of quanteda
with the ones in caret
. However, I can't seem to get caret
to work right.
这是一些复制代码.首先在 quanteda
端:
Here is some code for reproduction. First on the quanteda
side:
library(quanteda)
library(quanteda.corpora)
library(caret)
corp <- data_corpus_movies
set.seed(300)
id_train <- sample(docnames(corp), size = 1500, replace = FALSE)
# get training set
training_dfm <- corpus_subset(corp, docnames(corp) %in% id_train) %>%
dfm(stem = TRUE)
# get test set (documents not in id_train, make features equal)
test_dfm <- corpus_subset(corp, !docnames(corp) %in% id_train) %>%
dfm(stem = TRUE) %>%
dfm_select(pattern = training_dfm,
selection = "keep")
# train model on sentiment
nb_quanteda <- textmodel_nb(training_dfm, docvars(training_dfm, "Sentiment"))
# predict and evaluate
actual_class <- docvars(test_dfm, "Sentiment")
predicted_class <- predict(nb_quanteda, newdata = test_dfm)
class_table_quanteda <- table(actual_class, predicted_class)
class_table_quanteda
#> predicted_class
#> actual_class neg pos
#> neg 202 47
#> pos 49 202
还不错.无需调谐,准确度为 80.8%.现在在 caret
Not bad. The accuracy is 80.8% percent without tuning. Now the same (as far as I know) in caret
training_m <- convert(training_dfm, to = "matrix")
test_m <- convert(test_dfm, to = "matrix")
nb_caret <- train(x = training_m,
y = as.factor(docvars(training_dfm, "Sentiment")),
method = "naive_bayes",
trControl = trainControl(method = "none"),
tuneGrid = data.frame(laplace = 1,
usekernel = FALSE,
adjust = FALSE),
verbose = TRUE)
predicted_class_caret <- predict(nb_caret, newdata = test_m)
class_table_caret <- table(actual_class, predicted_class_caret)
class_table_caret
#> predicted_class_caret
#> actual_class neg pos
#> neg 246 3
#> pos 249 2
不仅这里的准确率很糟糕(49.6% - 大概是概率),pos 类几乎没有被预测到!所以我很确定我在这里遗漏了一些重要的东西,因为我认为实现应该非常相似,但不确定是什么.
Not only is the accuracy abysmal here (49.6% - roughly chance), the pos class is hardly ever predicted at all! So I'm pretty sure I'm missing something crucial here, as I would assume the implementations should be fairly similar, but not sure what.
我已经查看了 quanteda
函数的源代码(希望它可以构建在 caret
或底层包之上),并看到有一些权重并进行平滑处理.如果我在训练前将同样的方法应用于我的 dfm(稍后设置 laplace = 0
),准确性会好一些.但也只有 53%.
I already looked at the source code for the quanteda
function (hoping that it might be built on caret
or the underlying package anyway) and saw that there is some weighting and smoothing going on. If I apply the same to my dfm before training (setting laplace = 0
later on), accuracy is a bit better. Yet also only 53%.
推荐答案
答案是caret(使用naivebayes中的naive_bayes
package) 假设高斯分布,而 quanteda::textmodel_nb()
基于更适合文本的多项分布(也可以选择伯努利分布).
The answer is that caret (which uses naive_bayes
from the naivebayes package) assumes a Gaussian distribution, whereas quanteda::textmodel_nb()
is based on a more text-appropriate multinomial distribution (with the option of a Bernoulli distribution as well).
textmodel_nb()
的文档复制了 IIR 一书中的示例(Manning、Raghavan 和 Schütze 2008)以及 Jurafsky 和 Martin(2018)的另一个示例也有参考.见:
The documentation for textmodel_nb()
replicates the example from the IIR book (Manning, Raghavan, and Schütze 2008) and a further example from Jurafsky and Martin (2018) is also referenced. See:
Manning、Christopher D.、Prabhakar Raghavan 和 Hinrich Schütze.2008. 信息检索简介.剑桥大学出版社(第 13 章).https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
Jurafsky、Daniel 和 James H. Martin.2018. 语音和语言处理.自然语言处理、计算语言学和语音识别简介.第三版草案,2018 年 9 月 23 日(第 4 章).https://web.stanford.edu/~jurafsky/slp3/4.pdf一个>
Jurafsky, Daniel, and James H. Martin. 2018. Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Draft of 3rd edition, September 23, 2018 (Chapter 4). https://web.stanford.edu/~jurafsky/slp3/4.pdf
另一个包 e1071 产生与您发现的相同的结果,因为它也是基于高斯分布.
Another package, e1071, produces the same results you found as it is also based on a Gaussian distribution.
library("e1071")
nb_e1071 <- naiveBayes(x = training_m,
y = as.factor(docvars(training_dfm, "Sentiment")))
nb_e1071_pred <- predict(nb_e1071, newdata = test_m)
table(actual_class, nb_e1071_pred)
## nb_e1071_pred
## actual_class neg pos
## neg 246 3
## pos 249 2
然而,caret 和 e1071 都适用于密集矩阵,这也是它们与 quanteda 方法相比速度如此缓慢的原因之一它在稀疏 dfm 上运行.因此,从分类器的适当性、效率和(根据您的结果)性能的角度来看,应该很清楚哪个是首选!
However both caret and e1071 work on dense matrices, which is one reason they are so mind-numbingly slow compared to the quanteda approach which operates on the sparse dfm. So from the standpoint of appropriateness, efficiency, and (as per your results) the performance of the classifier, it should be pretty clear which one is preferred!
library("rbenchmark")
benchmark(
quanteda = {
nb_quanteda <- textmodel_nb(training_dfm, docvars(training_dfm, "Sentiment"))
predicted_class <- predict(nb_quanteda, newdata = test_dfm)
},
caret = {
nb_caret <- train(x = training_m,
y = as.factor(docvars(training_dfm, "Sentiment")),
method = "naive_bayes",
trControl = trainControl(method = "none"),
tuneGrid = data.frame(laplace = 1,
usekernel = FALSE,
adjust = FALSE),
verbose = FALSE)
predicted_class_caret <- predict(nb_caret, newdata = test_m)
},
e1071 = {
nb_e1071 <- naiveBayes(x = training_m,
y = as.factor(docvars(training_dfm, "Sentiment")))
nb_e1071_pred <- predict(nb_e1071, newdata = test_m)
},
replications = 1
)
## test replications elapsed relative user.self sys.self user.child sys.child
## 2 caret 1 29.042 123.583 25.896 3.095 0 0
## 3 e1071 1 217.177 924.157 215.587 1.169 0 0
## 1 quanteda 1 0.235 1.000 0.213 0.023 0 0
这篇关于Quanteda 与插入符号中的朴素贝叶斯:结果截然不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!