在 Scala Spark 中使用数据框的朴素贝叶斯多项式文本分类器 [英] Naive-bayes multinomial text classifier using Data frame in Scala Spark

查看:25
本文介绍了在 Scala Spark 中使用数据框的朴素贝叶斯多项式文本分类器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试构建一个 NaiveBayes 分类器,将数据库中的数据加载为包含(标签、文本)的 DataFrame.这是数据样本(多项标签):

I am trying to build a NaiveBayes classifier, loading the data from database as DataFrame which contains (label, text). Here's the sample of data (multinomial label):

label|             feature|
+-----+--------------------+
|    1|combusting prepar...|
|    1|adhesives for ind...|
|    1|                    |
|    1| salt for preserving|
|    1|auxiliary fluids ...|

我对标记化、停用词、n-gram 和 hashTF 使用了以下转换:

I have used following transformation for tokenization, stopword, n-gram, and hashTF :

val selectedData = df.select("label", "feature")
// Tokenize RDD
val tokenizer = new Tokenizer().setInputCol("feature").setOutputCol("words")
val regexTokenizer = new   RegexTokenizer().setInputCol("feature").setOutputCol("words").setPattern("\\W")
val tokenized = tokenizer.transform(selectedData)
tokenized.select("words", "label").take(3).foreach(println)

// Removing stop words
val remover = new        StopWordsRemover().setInputCol("words").setOutputCol("filtered")
val parsedData = remover.transform(tokenized) 

// N-gram
val ngram = new NGram().setInputCol("filtered").setOutputCol("ngrams")
val ngramDataFrame = ngram.transform(parsedData) 
ngramDataFrame.take(3).map(_.getAs[Stream[String]]("ngrams").toList).foreach(println)

//hashing function
val hashingTF = new HashingTF().setInputCol("ngrams").setOutputCol("hash").setNumFeatures(1000)
val featurizedData = hashingTF.transform(ngramDataFrame)

转换的输出:

+-----+--------------------+--------------------+--------------------+------    --------------+--------------------+
|label|             feature|               words|            filtered|                  ngrams|                hash|
+-----+--------------------+--------------------+--------------------+------    --------------+--------------------+
|    1|combusting prepar...|[combusting, prep...|[combusting, prep...|    [combusting prepa...|(1000,[124,161,69...|
|    1|adhesives for ind...|[adhesives, for, ...|[adhesives, indus...| [adhesives indust...|(1000,[451,604],[...|
|    1|                    |                  []|                  []|                     []|        (1000,[],[])|
|    1| salt for preserving|[salt, for, prese...|  [salt, preserving]|   [salt   preserving]|  (1000,[675],[1.0])|
|    1|auxiliary fluids ...|[auxiliary, fluid...|[auxiliary, fluid...|[auxiliary fluids...|(1000,[661,696,89...|

要构建朴素贝叶斯模型,我需要将标签和特征转换为LabelPoint.以下方法我尝试将数据帧转换为 RDD 并创建标签点:

To build a Naive Bayes model, I need to convert the label and feature into LabelPoint. Following approaches I have tried to convert a dataframe into RDD and create labelpoint:

val rddData = featurizedData.select("label","hash").rdd

val trainData = rddData.map { line =>
  val parts = line.split(',')
  LabeledPoint(parts(0), parts(1))
}


val rddData = featurizedData.select("label","hash").rdd.map(r =>   (Try(r(0).asInstanceOf[Integer]).get.toDouble,   Try(r(1).asInstanceOf[org.apache.spark.mllib.linalg.SparseVector]).get))

val trainData = rddData.map { line =>
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble,   Vectors.dense(parts(1).split(',').map(_.toDouble)))
}

我收到以下错误:

 scala> val trainData = rddData.map { line =>
 |   val parts = line.split(',')
 |   LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(',').map(_.toDouble)))
 | }
 <console>:67: error: value split is not a member of (Double,    org.apache.spark.mllib.linalg.SparseVector)
     val parts = line.split(',')
                      ^
<console>:68: error: not found: value Vectors
     LabeledPoint(parts(0).toDouble,   Vectors.dense(parts(1).split(',').map(_.toDouble)))

<小时>

编辑 1:


Edit 1:

根据以下建议,我创建了 LabelPoint 并训练了模型.

As per below suggestion, I have created the LabelPoint and train the Model.

val trainData = featurizedData.select("label","features")

val trainLabel = trainData.map(line =>  LabeledPoint(Try(line(0).asInstanceOf[Integer]).get.toDouble,Try(line(1).asInsta nceOf[org.apache.spark.mllib.linalg.SparseVector]).get))

val splits = trainLabel.randomSplit(Array(0.8, 0.2), seed = 11L)
val training = splits(0)
val test = splits(1)

val model = NaiveBayes.train(training, lambda = 1.0, modelType = "multinomial")

val predictionAndLabels = test.map { point => 
   val score = model.predict(point.features)
   (score, point.label)}

使用 N-gram 和不使用 N-gram 以及不同的哈希特征数时,我的准确度降低了 40% 左右.我的数据集包含 5000 行和 45 个多项式标签.有没有办法提高模型性能?提前致谢

I am getting less accuracy around 40% with N-gram and without N-gram along with different hash feature number. My dataset contains 5000 row and 45 mutlinomial label. Is there any way to improve the model performance? Thanks in advance

推荐答案

您不需要将 featurizedData 转换为 RDD,因为 Apache Spark 有两个库 MLMLLib,第一个与 DataFrames 一起工作,而 MLLib 工作使用 RDDs.因此,您可以使用 ML,因为您已经有了 DataFrame.

You don't need to transform your featurizedData into an RDD, because Apache Spark has two libraries ML and MLLib, the first one works with DataFrames, whereas MLLib works using RDDs. Therefore, you can work with ML because you already have a DataFrame.

为了实现这一点,您只需将列重命名为 (label, features),并适合您的模型,如 NaiveBayes,示例如下.

In order to achieve this, you just need to rename your columns to (label, features), and fit your model, as they show in NaiveBayes, example bellow.

df = sqlContext.createDataFrame([
    Row(label=0.0, features=Vectors.dense([0.0, 0.0])),
    Row(label=0.0, features=Vectors.dense([0.0, 1.0])),
    Row(label=1.0, features=Vectors.dense([1.0, 0.0]))])
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")
model = nb.fit(df)

关于您得到的错误,是因为您已经有一个SparseVector,而该类没有split 方法.所以多考虑一下,你的 RDD 几乎拥有你实际需要的结构,但你必须将 Tuple 转换为 LabeledPoint.

About the error you get, is because you already have a SparseVector, and that class doesn't have a split method. So thinking more about this, your RDD almost has the structure you actually require, but you have to convert the Tuple to a LabeledPoint.

有一些技术可以提高性能,我想到的第一个是去除停用词(例如,a,an,to,虽然等...),第二个是计算数字文本中的不同单词,然后手动构建向量,即这是因为如果哈希数较低,则不同的单词可能具有相同的哈希值,因此性能不佳.

There are some techniques to improve the performance, the first one that comes to my mind is to remove stopwords (e.g. the, a, an, to, although, etc...), the second one is to count the number of different words in your texts and then construct the vectors manually, i.e. this is because if the hashing number is low then different words might have the same hash, hence a bad performance.

这篇关于在 Scala Spark 中使用数据框的朴素贝叶斯多项式文本分类器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆