Quanteda 与插入符号中的朴素贝叶斯:结果截然不同 [英] Naive Bayes in Quanteda vs caret: wildly different results

查看:40
本文介绍了Quanteda 与插入符号中的朴素贝叶斯:结果截然不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将 quantedacaret 包结合使用,以根据训练样本对文本进行分类.作为测试运行,我想将 quanteda 的内置朴素贝叶斯分类器与 caret 中的分类器进行比较.但是,我似乎无法让 caret 正常工作.

I'm trying to use the packages quanteda and caret together to classify text based on a trained sample. As a test run, I wanted to compare the build-in naive bayes classifier of quanteda with the ones in caret. However, I can't seem to get caret to work right.

这是一些复制代码.首先在 quanteda 端:

Here is some code for reproduction. First on the quanteda side:

library(quanteda)
library(quanteda.corpora)
library(caret)
corp <- data_corpus_movies
set.seed(300)
id_train <- sample(docnames(corp), size = 1500, replace = FALSE)

# get training set
training_dfm <- corpus_subset(corp, docnames(corp) %in% id_train) %>%
  dfm(stem = TRUE)

# get test set (documents not in id_train, make features equal)
test_dfm <- corpus_subset(corp, !docnames(corp) %in% id_train) %>%
  dfm(stem = TRUE) %>% 
  dfm_select(pattern = training_dfm, 
             selection = "keep")

# train model on sentiment
nb_quanteda <- textmodel_nb(training_dfm, docvars(training_dfm, "Sentiment"))

# predict and evaluate
actual_class <- docvars(test_dfm, "Sentiment")
predicted_class <- predict(nb_quanteda, newdata = test_dfm)
class_table_quanteda <- table(actual_class, predicted_class)
class_table_quanteda
#>             predicted_class
#> actual_class neg pos
#>          neg 202  47
#>          pos  49 202

还不错.无需调谐,准确度为 80.8%.现在在 caret

Not bad. The accuracy is 80.8% percent without tuning. Now the same (as far as I know) in caret

training_m <- convert(training_dfm, to = "matrix")
test_m <- convert(test_dfm, to = "matrix")
nb_caret <- train(x = training_m,
                  y = as.factor(docvars(training_dfm, "Sentiment")),
                  method = "naive_bayes",
                  trControl = trainControl(method = "none"),
                  tuneGrid = data.frame(laplace = 1,
                                        usekernel = FALSE,
                                        adjust = FALSE),
                  verbose = TRUE)

predicted_class_caret <- predict(nb_caret, newdata = test_m)
class_table_caret <- table(actual_class, predicted_class_caret)
class_table_caret
#>             predicted_class_caret
#> actual_class neg pos
#>          neg 246   3
#>          pos 249   2

不仅这里的准确率很糟糕(49.6% - 大概是概率),pos 类几乎没有被预测到!所以我很确定我在这里遗漏了一些重要的东西,因为我认为实现应该非常相似,但不确定是什么.

Not only is the accuracy abysmal here (49.6% - roughly chance), the pos class is hardly ever predicted at all! So I'm pretty sure I'm missing something crucial here, as I would assume the implementations should be fairly similar, but not sure what.

我已经查看了 quanteda 函数的源代码(希望它可以构建在 caret 或底层包之上),并看到有一些权重并进行平滑处理.如果我在训练前将同样的方法应用于我的 dfm(稍后设置 laplace = 0),准确性会好一些.但也只有 53%.

I already looked at the source code for the quanteda function (hoping that it might be built on caret or the underlying package anyway) and saw that there is some weighting and smoothing going on. If I apply the same to my dfm before training (setting laplace = 0 later on), accuracy is a bit better. Yet also only 53%.

推荐答案

答案是caret(使用naivebayes中的naive_bayespackage) 假设高斯分布,而 quanteda::textmodel_nb() 基于更适合文本的多项分布(也可以选择伯努利分布).

The answer is that caret (which uses naive_bayes from the naivebayes package) assumes a Gaussian distribution, whereas quanteda::textmodel_nb() is based on a more text-appropriate multinomial distribution (with the option of a Bernoulli distribution as well).

textmodel_nb() 的文档复制了 IIR 一书中的示例(Manning、Raghavan 和 Schütze 2008)以及 Jurafsky 和 ​​Martin(2018)的另一个示例也有参考.见:

The documentation for textmodel_nb() replicates the example from the IIR book (Manning, Raghavan, and Schütze 2008) and a further example from Jurafsky and Martin (2018) is also referenced. See:

Jurafsky、Daniel 和 James H. Martin.2018. 语音和语言处理.自然语言处理、计算语言学和语音识别简介.第三版草案,2018 年 9 月 23 日(第 4 章).https://web.stanford.edu/~jurafsky/slp3/4.pdf

Jurafsky, Daniel, and James H. Martin. 2018. Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Draft of 3rd edition, September 23, 2018 (Chapter 4). https://web.stanford.edu/~jurafsky/slp3/4.pdf

另一个包 e1071 产生与您发现的相同的结果,因为它也是基于高斯分布.

Another package, e1071, produces the same results you found as it is also based on a Gaussian distribution.

library("e1071")
nb_e1071 <- naiveBayes(x = training_m,
                       y = as.factor(docvars(training_dfm, "Sentiment")))
nb_e1071_pred <- predict(nb_e1071, newdata = test_m)
table(actual_class, nb_e1071_pred)
##             nb_e1071_pred
## actual_class neg pos
##          neg 246   3
##          pos 249   2

然而,carete1071 都适用于密集矩阵,这也是它们与 quanteda 方法相比速度如此缓慢的原因之一它在稀疏 dfm 上运行.因此,从分类器的适当性、效率和(根据您的结果)性能的角度来看,应该很清楚哪个是首选!

However both caret and e1071 work on dense matrices, which is one reason they are so mind-numbingly slow compared to the quanteda approach which operates on the sparse dfm. So from the standpoint of appropriateness, efficiency, and (as per your results) the performance of the classifier, it should be pretty clear which one is preferred!

library("rbenchmark")
benchmark(
    quanteda = { 
        nb_quanteda <- textmodel_nb(training_dfm, docvars(training_dfm, "Sentiment"))
        predicted_class <- predict(nb_quanteda, newdata = test_dfm)
    },
    caret = {
        nb_caret <- train(x = training_m,
                          y = as.factor(docvars(training_dfm, "Sentiment")),
                          method = "naive_bayes",
                          trControl = trainControl(method = "none"),
                          tuneGrid = data.frame(laplace = 1,
                                                usekernel = FALSE,
                                                adjust = FALSE),
                          verbose = FALSE)
        predicted_class_caret <- predict(nb_caret, newdata = test_m)
    },
    e1071 = {
        nb_e1071 <- naiveBayes(x = training_m,
                       y = as.factor(docvars(training_dfm, "Sentiment")))
        nb_e1071_pred <- predict(nb_e1071, newdata = test_m)
    },
    replications = 1
)
##       test replications elapsed relative user.self sys.self user.child sys.child
## 2    caret            1  29.042  123.583    25.896    3.095          0         0
## 3    e1071            1 217.177  924.157   215.587    1.169          0         0
## 1 quanteda            1   0.235    1.000     0.213    0.023          0         0

这篇关于Quanteda 与插入符号中的朴素贝叶斯:结果截然不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆