如何在mllib中准备训练数据 [英] How to prepare for training data in mllib

查看:21
本文介绍了如何在mllib中准备训练数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TL;DR;如何使用 mllib 训练我的 wiki 数据(文本和类别)以预测推文?

TL;DR; How do I use mllib to train my wiki data (text & category) for prediction against tweets?

我无法弄清楚如何转换我的标记化 wiki 数据,以便它可以通过 NaiveBayesLogisticRegression 进行训练.我的目标是使用经过训练的模型与推文*进行比较.我已经尝试将管道与 LR 和 HashingTFIDF 一起用于 NaiveBayes,但我一直得到错误的预测.这是我尝试过的:

I have trouble figuring out how to convert my tokenized wiki data so that it can be trained through either NaiveBayes or LogisticRegression. My goal is to use the trained model for comparison against tweets*. I've tried using pipelines with LR and HashingTF with IDF for NaiveBayes but I keep getting wrong predictions. Here's what I've tried:

*请注意,我想将 wiki 数据中的许多类别用于我的标签......我只看到了二元分类(它是一个类别或另一个类别)......是否可以做我想做的事?

*Note that I would like to use the many categories in the wiki data for my labels...I've only seen binary classification (it's one category or another)....is it possible to do what I want?

import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.ml.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.ml.feature.RegexTokenizer

case class WikiData(category: String, text: String)
case class LabeledData(category: String, text: String, label: Double)

val wikiData = sc.parallelize(List(WikiData("Spark", "this is about spark"), WikiData("Hadoop","then there is hadoop")))

val categoryMap = wikiData.map(x=>x.category).distinct.zipWithIndex.mapValues(x=>x.toDouble/1000).collectAsMap

val labeledData = wikiData.map(x=>LabeledData(x.category, x.text, categoryMap.get(x.category).getOrElse(0.0))).toDF

val tokenizer = new RegexTokenizer()
  .setInputCol("text")
  .setOutputCol("words")
  .setPattern("/W+")
val hashingTF = new HashingTF()
  .setNumFeatures(1000)
  .setInputCol(tokenizer.getOutputCol)
  .setOutputCol("features")
val lr = new LogisticRegression()
  .setMaxIter(10)
  .setRegParam(0.01)
val pipeline = new Pipeline()
  .setStages(Array(tokenizer, hashingTF, lr))

val model = pipeline.fit(labeledData)

model.transform(labeledData).show

朴素贝叶斯

val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documentsAsWordSequenceAlready)

import org.apache.spark.mllib.feature.IDF

tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)

tf.cache()
val idf = new IDF(minDocFreq = 2).fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)

//to create tfidfLabeled (below) I ran a map set the labels...but again it seems to have to be 1.0 or 0.0?

NaiveBayes.train(tfidfLabeled)
  .predict(hashingTF.transform(tweet))
  .collect

推荐答案

ML LogisticRegression 尚不支持多项分类,但 MLLib NaiveBayesLogisticRegressionWithLBFGS.在第一种情况下,它应该默认工作:

ML LogisticRegression doesn't support multinomial classification yet, but it is supported by both MLLib NaiveBayes and LogisticRegressionWithLBFGS. In the first case it should work by default:

import org.apache.spark.mllib.classification.NaiveBayes

val nbModel = new NaiveBayes()
  .setModelType("multinomial") // This is default value
  .run(train)

但是对于逻辑回归,您应该提供许多类:

but for logistic regression you should provide a number of classes:

import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS

val model = new LogisticRegressionWithLBFGS()
  .setNumClasses(n) // Set number of classes
  .run(trainingData)

关于预处理步骤,这是一个相当广泛的话题,如果无法访问您的数据,很难为您提供有意义的建议,因此您在下面找到的所有内容都只是猜测:

Regarding preprocessing steps it is a quite broad topic and it is hard to give you a meaningful advice without an access to your data so everything you find below is just a wild guess:

  • 据我所知,您使用 wiki 数据进行训练,使用推文进行测试.如果这是真的,一般来说这是个坏主意.您可以预期这两组使用明显不同的词汇、语法和拼写
  • 简单的正则表达式标记器可以在标准化文本上表现良好,但根据我的经验,它不适用于像推文这样的非正式文本
  • HashingTF 是获得基线模型的好方法,但它是一种极其简化的方法,尤其是在您不应用任何过滤步骤的情况下.如果您决定使用它,您至少应该增加功能数量或使用默认值 (2^20)
  • as far I understand you use wiki data for training and tweets for testing. If that's true it is generally speaking a bad idea. You can expect that both sets use significantly different vocabulary, grammar and spelling
  • simple regex tokenizer can perform pretty well on standardized text but from my experience it won't work well on informal text like tweets
  • HashingTF can be a good way to obtain a baseline model but it is extremely simplified approach, especially if you don't apply any filtering steps. If you decide to use it you should at least increase number of features or use a default value (2^20)

编辑(使用 IDF 为朴素贝叶斯准备数据)

EDIT (Preparing data for Naive Bayes with IDF)

  • 使用机器学习流水线:
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.ml.feature.IDF
import org.apache.spark.sql.Row

val tokenizer = ???

val hashingTF = new HashingTF()
  .setNumFeatures(1000)
  .setInputCol(tokenizer.getOutputCol)
  .setOutputCol("rawFeatures")

val idf = new IDF()
  .setInputCol(hashingTF.getOutputCol)
  .setOutputCol("features")

val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, idf))
val model = pipeline.fit(labeledData)

model
 .transform(labeledData)
 .select($"label", $"features")
 .map{case Row(label: Double, features: Vector) => LabeledPoint(label, features)}

  • 使用 MLlib 转换器:
  • import org.apache.spark.mllib.feature.HashingTF
    import org.apache.spark.mllib.linalg.Vector
    import org.apache.spark.mllib.feature.{IDF, IDFModel}
    
    val labeledData = wikiData.map(x => 
      LabeledData(x.category, x.text, categoryMap.get(x.category).getOrElse(0.0)))
    
    val p = "\\W+".r
    val raw = labeledData.map{
        case LabeledData(_, text, label) => (label, p.split(text))}
    
    val hashingTF: org.apache.spark.mllib.feature.HashingTF = new HashingTF(1000)
    val tf = raw.map{case (label, text) => (label, hashingTF.transform(text))}
    
    val idf: org.apache.spark.mllib.feature.IDFModel = new IDF().fit(tf.map(_._2))
    tf.map{
      case (label, rawFeatures) => LabeledPoint(label, idf.transform(rawFeatures))}
    

    注意:由于转换器需要 JVM 访问,MLlib 版本在 PySpark 中不起作用.如果您更喜欢 Python,则必须拆分数据转换和压缩.

    Note: Since transformers require JVM access MLlib version won't work in PySpark. If you prefer Python you have to split data transform and zip.

    编辑(为机器学习算法准备数据):

    EDIT (Preparing data for ML algorithms):

    虽然下面这段代码乍一看很有效

    While following piece of code looks valid at first glance

    val categoryMap = wikiData
      .map(x=>x.category)
      .distinct
      .zipWithIndex
      .mapValues(x=>x.toDouble/1000)
      .collectAsMap
    
    val labeledData = wikiData.map(x=>LabeledData(
        x.category, x.text, categoryMap.get(x.category).getOrElse(0.0))).toDF
    

    它不会为 ML 算法生成有效标签.

    it won't generate valid labels for ML algorithms.

    首先 ML 期望标签在 (0.0, 1.0, ..., n.0) 中,其中 n 是类数.如果您的示例管道其中一个类获得标签 0.001,您将收到如下错误:

    First of all ML expects labels to be in (0.0, 1.0, ..., n.0) where n is number of classes. If your example pipeline where one of the classes get label 0.001 you'll get an error like this:

    ERROR LogisticRegression:分类标签应该在 {0 to 0 Found 1 invalid labels 中.

    ERROR LogisticRegression: Classification labels should be in {0 to 0 Found 1 invalid labels.

    显而易见的解决方案是在生成映射时避免除法

    The obvious solution is to avoid division when you generate mapping

    .mapValues(x=>x.toDouble)
    

    虽然它适用于 LogisticRegression,但其他 ML 算法仍然会失败.例如使用 RandomForestClassifier 你会得到

    While it will work for LogisticRegression other ML algorithms will still fail. For example with RandomForestClassifier you'll get

    RandomForestClassifier 的输入带有无效的标签列标签,但没有指定类的数量.请参阅 StringIndexer.

    RandomForestClassifier was given input with invalid label column label, without the number of classes specified. See StringIndexer.

    有趣的是 RandomForestClassifier 的 ML 版本,不像它的 MLlib 对应物,不提供设置多个类的方法.结果它期望在 DataFrame 列上设置特殊属性.最简单的方法是使用错误信息中提到的StringIndexer:

    What it interesting ML version of RandomForestClassifier, unlike its MLlib counterpart, doesn't provide a method to set a number of classes. Turns out it expects special attributes to be set on a DataFrame column. The simplest approach is to use StringIndexer mentioned in the error message:

    import org.apache.spark.ml.feature.StringIndexer
    
    val indexer = new StringIndexer()
      .setInputCol("category")
      .setOutputCol("label")
    
    val pipeline = new Pipeline()
      .setStages(Array(indexer, tokenizer, hashingTF, idf, lr))
    
    val model = pipeline.fit(wikiData.toDF)
    

    这篇关于如何在mllib中准备训练数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆