如何prepare在mllib训练数据 [英] How to prepare for training data in mllib

查看:327
本文介绍了如何prepare在mllib训练数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TL; DR;
如何使用mllib训练我的维基数据(文本和放大器;类)吗?对鸣叫prediction

我是当它涉及到机器学习很新,所以我有麻烦搞清楚如何我符号化的维基数据转换,以便它可以通过接受培训,无论是 NaiveBayes 逻辑回归。我的目标是利用受过训练的模式对鸣叫比较*。我已经使用管道与LR和 HashingTF IDF NaiveBayes ...但我不断收到错误predictions。以下是我已经试过:

*注意,我想用很多类别维基数据为我的标签......我只看过二元分类(这是一类或其他)......是有可能做我想做的?

管道W¯¯LR

 进口org.apache.spark.rdd.RDD
进口org.apache.spark.SparkContext
进口org.apache.spark.ml.feature.HashingTF
进口org.apache.spark.mllib.linalg.Vector
进口org.apache.spark.ml.feature.RegexTokenizer案例类维基数据(类别:字符串,文本:字符串)
案例类LabeledData(类别:字符串,文本:字符串,标签:双人间)VAL维基数据= sc.parallelize(名单(维基数据(星火,这是关于火花),维基数据(Hadoop的,再有就是Hadoop的)))VAL categoryMap = wikiData.map(X => x.category).distinct.zipWithIndex.mapValues​​(X => x.toDouble / 1000).collectAsMapVAL labeledData = wikiData.map(X => LabeledData(x.category,x.text,categoryMap.get(x.category).getOrElse(0.0)))。toDFVAL标记生成器=新RegexTokenizer()
  .setInputCol(文本)
  .setOutputCol(字)
  .setPattern(/ W +)
VAL hashingTF =新HashingTF()
  .setNumFeatures(1000)
  .setInputCol(tokenizer.getOutputCol)
  .setOutputCol(特征)
VAL LR =新逻辑回归()
  .setMaxIter(10)
  .setRegParam(0.01)
VAL管道=新管道()
  .setStages(阵列(标记生成器,hashingTF,LR))VAL模型= pipeline.fit(labeledData)model.transform(labeledData).show

朴素贝叶斯

  VAL hashingTF =新HashingTF()
VAL TF​​:RDD [向量] = hashingTF.transform(documentsAsWordSequenceAlready)进口org.apache.spark.mllib.feature.IDFtf.cache()
VAL IDF =新IDF()。拟合(TF)
VAL TF​​IDF:RDD [向量] = idf.transform(TF)tf.cache()
VAL IDF =新IDF(minDocFreq = 2).fit(TF)
VAL TF​​IDF:RDD [向量] = idf.transform(TF)//创建tfidfLabeled(下)我跑地图设置标签...但同样它似乎已经是1.0还是0.0?NaiveBayes.train(tfidfLabeled)
  。predict(hashingTF.transform(鸣叫))
  。搜集


解决方案

ML 逻辑回归不支持多项分类还没有,但它是由两个MLLib支持 NaiveBayes LogisticRegressionWithLBFGS 。在第一种情况下,它应该是默认工作:

进口org.apache.spark.mllib.classification.NaiveBayesVAL nbModel =新NaiveBayes()
  .setModelType(多项)//这是默认值
  .RUN(火车)

但对于回归,你应该提供一些类:

进口org.apache.spark.mllib.classification.LogisticRegressionWithLBFGSVAL模式=新LogisticRegressionWithLBFGS()
  类.setNumClasses(N)//设置数
  .RUN(trainingData)

对于preprocessing步骤它是一个非常宽泛的主题,这是很难给你一个有意义的意见,而不对数据的访问,所以你在下面找到一切只是胡乱猜测:


  • 据我了解你使用的培训和维基数据鸣叫进行测试。如果这是真的它一般来说是一个坏主意。可以预计,这两组使用不同显著词汇,语法和拼写

  • 简单的regex标记生成器可以很好的标准化文本,但根据我的经验也不会像鸣叫的非正式文本
  • 很好地工作执行pretty
  • HashingTF 即可获得基准模型的好方法,但它是非常简单的方法,特别是如果你不应用任何过滤步骤。如果您决定使用它,你至少应该增加的功能数量或使用默认值(2 ^ 20)

修改(适用于朴素贝叶斯与IDF preparing数据)


  • 使用ML管道:

进口org.apache.spark.mllib.regression.LabeledPoint
进口org.apache.spark.mllib.linalg.Vector
进口org.apache.spark.ml.feature.IDF
进口org.apache.spark.sql.RowVAL标记生成器=?VAL hashingTF =新HashingTF()
  .setNumFeatures(1000)
  .setInputCol(tokenizer.getOutputCol)
  .setOutputCol(rawFeatures)VAL IDF =新IDF()
  .setInputCol(hashingTF.getOutputCol)
  .setOutputCol(特征)VAL管道=新管道()。setStages(阵列(标记生成器,hashingTF,IDF))
VAL模型= pipeline.fit(labeledData)模型
 .transform(labeledData)
 。选择($标签,$特色)
 .MAP {行的情况下(标签:双,功能:矢量)=> LabeledPoint(标签功能)}


  • 使用MLlib变压器:

进口org.apache.spark.mllib.feature.HashingTF
进口org.apache.spark.mllib.linalg.Vector
进口org.apache.spark.mllib.feature {IDF,IDFModel}VAL labeledData = wikiData.map(X =>
  LabeledData(x.category,x.text,categoryMap.get(x.category).getOrElse(0.0)))VAL P =\\\\ W +:R
VAL原料= {labeledData.map
    案例LabeledData(_,文本,标签)=> (标签,p.split(文本))}VAL hashingTF:org.apache.spark.mllib.feature.HashingTF =新HashingTF(1000)
VAL TF​​ = {raw.map的情况下(标签,文本)=> (标签,hashingTF.transform(文本))}VAL IDF:org.apache.spark.mllib.feature.IDFModel =新IDF()符合(tf.map(_._ 2))
tf.map {
  情况下(标签,rawFeatures)=> LabeledPoint(标签,idf.transform(rawFeatures))}

请注意:由于变压器需要JVM访问MLlib版本将不会在PySpark工作。如果preFER Python中,你必须拆分数据转换和压缩

修改(适用于ML算法preparing数据):

虽然code的下面这段看起来有效的第一眼

VAL categoryMap =维基数据
  .MAP(X => x.category)
  。不同
  .zipWithIndex
  .mapValues​​(X => x.toDouble / 1000)
  .collectAsMapVAL labeledData = wikiData.map(X => LabeledData(
    x.category,x.text,categoryMap.get(x.category).getOrElse(0.0)))。toDF

也不会产生 ML 算法有效的标签。

所有 ML 首先希望标签在(0.0,1.0,...,n.0),其中n是类数。如果你的榜样管道所在班的一送0.001标签,你会得到一个错误这样的:


  

错误逻辑回归:分类标签应该在{0比0找到1无效的标签


显而易见的解决方案是为了避免分裂,当你生成映射

  .mapValues​​(X => x.toDouble)

虽然它会为逻辑回归工作的其他 ML 算法仍然会失败。例如用 RandomForestClassifier 您将获得


  

RandomForestClassifier给予无效的标签栏标签输入,没有指定类的数量。见StringIndexer。


这是什么 RandomForestClassifier 有趣的ML版本,不像它的 MLlib 对口,不提供设置方法若干类。原来预计特殊属性将在数据帧列设置。最简单的方法是使用 StringIndexer 中的错误消息中提到:

进口org.apache.spark.ml.feature.StringIndexerVAL索引=新StringIndexer()
  .setInputCol(类别)
  .setOutputCol(标签)VAL管道=新管道()
  .setStages(阵列(索引,分词,hashingTF,IDF,LR))VAL模型= pipeline.fit(wikiData.toDF)

TL;DR; How do I use mllib to train my wiki data (text & category) for prediction against tweets?

I am VERY new when it comes to machine learning, so I am having trouble figuring out how to convert my tokenized wiki data so that it can be trained through either NaiveBayes or LogisticRegression. My goal is to use the trained model for comparison against tweets*. I've tried using pipelines with LR and HashingTF with IDF for NaiveBayes...but I keep getting wrong predictions. Here's what I've tried:

*Note that I would like to use the many categories in the wiki data for my labels...I've only seen binary classification (it's one category or another)....is it possible to do what I want?

Pipeline w LR

import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.ml.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.ml.feature.RegexTokenizer

case class WikiData(category: String, text: String)
case class LabeledData(category: String, text: String, label: Double)

val wikiData = sc.parallelize(List(WikiData("Spark", "this is about spark"), WikiData("Hadoop","then there is hadoop")))

val categoryMap = wikiData.map(x=>x.category).distinct.zipWithIndex.mapValues(x=>x.toDouble/1000).collectAsMap

val labeledData = wikiData.map(x=>LabeledData(x.category, x.text, categoryMap.get(x.category).getOrElse(0.0))).toDF

val tokenizer = new RegexTokenizer()
  .setInputCol("text")
  .setOutputCol("words")
  .setPattern("/W+")
val hashingTF = new HashingTF()
  .setNumFeatures(1000)
  .setInputCol(tokenizer.getOutputCol)
  .setOutputCol("features")
val lr = new LogisticRegression()
  .setMaxIter(10)
  .setRegParam(0.01)
val pipeline = new Pipeline()
  .setStages(Array(tokenizer, hashingTF, lr))

val model = pipeline.fit(labeledData)

model.transform(labeledData).show

Naive Bayes

val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documentsAsWordSequenceAlready)

import org.apache.spark.mllib.feature.IDF

tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)

tf.cache()
val idf = new IDF(minDocFreq = 2).fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)

//to create tfidfLabeled (below) I ran a map set the labels...but again it seems to have to be 1.0 or 0.0?

NaiveBayes.train(tfidfLabeled)
  .predict(hashingTF.transform(tweet))
  .collect

解决方案

ML LogisticRegression doesn't support multinomial classification yet, but it is supported by both MLLib NaiveBayes and LogisticRegressionWithLBFGS. In the first case it should work by default:

import org.apache.spark.mllib.classification.NaiveBayes

val nbModel = new NaiveBayes()
  .setModelType("multinomial") // This is default value
  .run(train)

but for logistic regression you should provide a number of classes:

import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS

val model = new LogisticRegressionWithLBFGS()
  .setNumClasses(n) // Set number of classes
  .run(trainingData)

Regarding preprocessing steps it is a quite broad topic and it is hard to give you a meaningful advice without an access to your data so everything you find below is just a wild guess:

  • as far I understand you use wiki data for training and tweets for testing. If that's true it is generally speaking a bad idea. You can expect that both sets use significantly different vocabulary, grammar and spelling
  • simple regex tokenizer can perform pretty well on standardized text but from my experience it won't work well on informal text like tweets
  • HashingTF can be a good way to obtain a baseline model but it is extremely simplified approach, especially if you don't apply any filtering steps. If you decide to use it you should at least increase number of features or use a default value (2^20)

EDIT (Preparing data for Naive Bayes with IDF)

  • using ML Pipelines:

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.ml.feature.IDF
import org.apache.spark.sql.Row

val tokenizer = ???

val hashingTF = new HashingTF()
  .setNumFeatures(1000)
  .setInputCol(tokenizer.getOutputCol)
  .setOutputCol("rawFeatures")

val idf = new IDF()
  .setInputCol(hashingTF.getOutputCol)
  .setOutputCol("features")

val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, idf))
val model = pipeline.fit(labeledData)

model
 .transform(labeledData)
 .select($"label", $"features")
 .map{case Row(label: Double, features: Vector) => LabeledPoint(label, features)}

  • using MLlib transformers:

import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.feature.{IDF, IDFModel}

val labeledData = wikiData.map(x => 
  LabeledData(x.category, x.text, categoryMap.get(x.category).getOrElse(0.0)))

val p = "\\W+".r
val raw = labeledData.map{
    case LabeledData(_, text, label) => (label, p.split(text))}

val hashingTF: org.apache.spark.mllib.feature.HashingTF = new HashingTF(1000)
val tf = raw.map{case (label, text) => (label, hashingTF.transform(text))}

val idf: org.apache.spark.mllib.feature.IDFModel = new IDF().fit(tf.map(_._2))
tf.map{
  case (label, rawFeatures) => LabeledPoint(label, idf.transform(rawFeatures))}

Note: Since transformers require JVM access MLlib version won't work in PySpark. If you prefer Python you have to split data transform and zip.

EDIT (Preparing data for ML algorithms):

While following piece of code looks valid at first glance

val categoryMap = wikiData
  .map(x=>x.category)
  .distinct
  .zipWithIndex
  .mapValues(x=>x.toDouble/1000)
  .collectAsMap

val labeledData = wikiData.map(x=>LabeledData(
    x.category, x.text, categoryMap.get(x.category).getOrElse(0.0))).toDF

it won't generate valid labels for ML algorithms.

First of all ML expects labels to be in (0.0, 1.0, ..., n.0) where n is number of classes. If your example pipeline where one of the classes get label 0.001 you'll get an error like this:

ERROR LogisticRegression: Classification labels should be in {0 to 0 Found 1 invalid labels.

The obvious solution is to avoid division when you generate mapping

.mapValues(x=>x.toDouble)

While it will work for LogisticRegression other ML algorithms will still fail. For example with RandomForestClassifier you'll get

RandomForestClassifier was given input with invalid label column label, without the number of classes specified. See StringIndexer.

What it interesting ML version of RandomForestClassifier, unlike its MLlib counterpart, doesn't provide a method to set a number of classes. Turns out it expects special attributes to be set on a DataFrame column. The simplest approach is to use StringIndexer mentioned in the error message:

import org.apache.spark.ml.feature.StringIndexer

val indexer = new StringIndexer()
  .setInputCol("category")
  .setOutputCol("label")

val pipeline = new Pipeline()
  .setStages(Array(indexer, tokenizer, hashingTF, idf, lr))

val model = pipeline.fit(wikiData.toDF)

这篇关于如何prepare在mllib训练数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆