朴素贝叶斯分类器仅基于先验概率进行决策 [英] Naive Bayes classifier bases decision only on a-priori probabilities

查看:459
本文介绍了朴素贝叶斯分类器仅基于先验概率进行决策的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试根据推文的情感将其分类为三类(买,持,卖).我正在使用R和软件包e1071.

I'm trying to classify tweets according to their sentiment into three categories (Buy, Hold, Sell). I'm using R and the package e1071.

我有两个数据帧:一个训练集和一组需要预测情绪的新推文.

I have two data frames: one trainingset and one set of new tweets which sentiment need to be predicted.

trainingset数据框:

trainingset dataframe:

   +--------------------------------------------------+

   **text | sentiment**

   *this stock is a good buy* | Buy

   *markets crash in tokyo* | Sell

   *everybody excited about new products* | Hold

   +--------------------------------------------------+

现在我想使用推文trainingset[,2]和情感类别trainingset[,4]训练模型.

Now I want to train the model using the tweet text trainingset[,2] and the sentiment category trainingset[,4].

classifier<-naiveBayes(trainingset[,2],as.factor(trainingset[,4]), laplace=1)

使用

classifier$tables$x

我发现条件概率是经过计算的.关于购买,持有和出售的每条推文都有不同的概率.到目前为止,很好.

I find that the conditional probabilities are calculated..There are different probabilities for every tweet concerning Buy,Hold and Sell.So far so good.

但是,当我用以下方法预测训练集时:

However when I predict the training set with:

predict(classifier, trainingset[,2], type="raw")

我得到的分类仅基于先验概率,这意味着每条推文都归类为持有"(因为持有"在情感中所占份额最大).因此,每条推文都具有相同的购买,持有和出售概率:

I get a classification which is based only on the a-priori probabilities, which means every tweet is classified as Hold (because "Hold" had the largest share among the sentiment). So every tweet has the same probabilities for Buy, Hold, and Sell:

      +--------------------------------------------------+

      **Id | Buy | Hold | Sell**

      1  |0.25 | 0.5  | 0.25

      2  |0.25 | 0.5  | 0.25

      3  |0.25 | 0.5  | 0.25

     ..  |..... | ....  | ...

      N  |0.25 | 0.5  | 0.25

     +--------------------------------------------------+

有什么主意我做错了吗? 感谢您的帮助!

Any ideas what I'm doing wrong? Appreciate your help!

谢谢

推荐答案

您似乎使用整个句子作为输入对模型进行了训练,而您似乎想使用单词作为您的输入功能.

It looks like you trained the model using whole sentences as inputs, while it seems that you want to use words as your input features.

用法:

## S3 method for class 'formula'
naiveBayes(formula, data, laplace = 0, ..., subset, na.action = na.pass)
## Default S3 method:
naiveBayes(x, y, laplace = 0, ...)


## S3 method for class 'naiveBayes'
predict(object, newdata,
  type = c("class", "raw"), threshold = 0.001, ...)

参数:

  x: A numeric matrix, or a data frame of categorical and/or
     numeric variables.

  y: Class vector.

尤其是如果您这样训练naiveBayes:

In particular, if you train naiveBayes this way:

x <- c("john likes cake", "marry likes cats and john")
y <- as.factor(c("good", "bad")) 
bayes<-naiveBayes( x,y )

您将获得一个仅能识别以下两个句子的分类器:

you get a classifier able to recognize just these two sentences:

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = x,y = y)

A-priori probabilities:
y
 bad good 
 0.5  0.5 

Conditional probabilities:
            x
      x
y      john likes cake marry likes cats and john
  bad                0                         1
  good               1                         0

要实现单词级别分类器,您需要以单词作为输入来运行它

to achieve a word level classifier you need to run it with words as inputs

x <-             c("john","likes","cake","marry","likes","cats","and","john")
y <- as.factors( c("good","good", "good","bad",  "bad",  "bad", "bad","bad") )
bayes<-naiveBayes( x,y )

你得到

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = x,y = y)

A-priori probabilities:
y
 bad good 
 0.625 0.375 

Conditional probabilities:
      x
y            and      cake      cats      john     likes     marry
  bad  0.2000000 0.0000000 0.2000000 0.2000000 0.2000000 0.2000000
  good 0.0000000 0.3333333 0.0000000 0.3333333 0.3333333 0.0000000

通常R不太适合处理NLP数据,python(或至少Java)将是更好的选择.

In general R is not well suited for processing NLP data, python (or at least Java) would be much better choice.

要将句子转换为单词,可以使用strsplit函数

To convert a sentence to the words, you can use the strsplit function

unlist(strsplit("john likes cake"," "))
[1] "john"  "likes" "cake" 

这篇关于朴素贝叶斯分类器仅基于先验概率进行决策的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆