如何prepare在mllib训练数据 [英] How to prepare for training data in mllib
问题描述
TL; DR;
如何使用mllib训练我的维基数据(文本和放大器;类)吗?对鸣叫prediction
我是当它涉及到机器学习很新,所以我有麻烦搞清楚如何我符号化的维基数据转换,以便它可以通过接受培训,无论是 NaiveBayes
或逻辑回归
。我的目标是利用受过训练的模式对鸣叫比较*。我已经使用管道与LR和 HashingTF
与 IDF
为 NaiveBayes $ C尝试$ C> ...但我不断收到错误predictions。以下是我已经试过:
*注意,我想用很多类别维基数据为我的标签......我只看过二元分类(这是一类或其他)......是有可能做我想做的?
管道W¯¯LR
进口org.apache.spark.rdd.RDD
进口org.apache.spark.SparkContext
进口org.apache.spark.ml.feature.HashingTF
进口org.apache.spark.mllib.linalg.Vector
进口org.apache.spark.ml.feature.RegexTokenizer案例类维基数据(类别:字符串,文本:字符串)
案例类LabeledData(类别:字符串,文本:字符串,标签:双人间)VAL维基数据= sc.parallelize(名单(维基数据(星火,这是关于火花),维基数据(Hadoop的,再有就是Hadoop的)))VAL categoryMap = wikiData.map(X => x.category).distinct.zipWithIndex.mapValues(X => x.toDouble / 1000).collectAsMapVAL labeledData = wikiData.map(X => LabeledData(x.category,x.text,categoryMap.get(x.category).getOrElse(0.0)))。toDFVAL标记生成器=新RegexTokenizer()
.setInputCol(文本)
.setOutputCol(字)
.setPattern(/ W +)
VAL hashingTF =新HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol(特征)
VAL LR =新逻辑回归()
.setMaxIter(10)
.setRegParam(0.01)
VAL管道=新管道()
.setStages(阵列(标记生成器,hashingTF,LR))VAL模型= pipeline.fit(labeledData)model.transform(labeledData).show
朴素贝叶斯
VAL hashingTF =新HashingTF()
VAL TF:RDD [向量] = hashingTF.transform(documentsAsWordSequenceAlready)进口org.apache.spark.mllib.feature.IDFtf.cache()
VAL IDF =新IDF()。拟合(TF)
VAL TFIDF:RDD [向量] = idf.transform(TF)tf.cache()
VAL IDF =新IDF(minDocFreq = 2).fit(TF)
VAL TFIDF:RDD [向量] = idf.transform(TF)//创建tfidfLabeled(下)我跑地图设置标签...但同样它似乎已经是1.0还是0.0?NaiveBayes.train(tfidfLabeled)
。predict(hashingTF.transform(鸣叫))
。搜集
ML 逻辑回归
不支持多项分类还没有,但它是由两个MLLib支持 NaiveBayes
和 LogisticRegressionWithLBFGS
。在第一种情况下,它应该是默认工作:
进口org.apache.spark.mllib.classification.NaiveBayesVAL nbModel =新NaiveBayes()
.setModelType(多项)//这是默认值
.RUN(火车)
但对于回归,你应该提供一些类:
进口org.apache.spark.mllib.classification.LogisticRegressionWithLBFGSVAL模式=新LogisticRegressionWithLBFGS()
类.setNumClasses(N)//设置数
.RUN(trainingData)
对于preprocessing步骤它是一个非常宽泛的主题,这是很难给你一个有意义的意见,而不对数据的访问,所以你在下面找到一切只是胡乱猜测:
- 据我了解你使用的培训和维基数据鸣叫进行测试。如果这是真的它一般来说是一个坏主意。可以预计,这两组使用不同显著词汇,语法和拼写
- 简单的regex标记生成器可以很好的标准化文本,但根据我的经验也不会像鸣叫的非正式文本 很好地工作执行pretty
-
HashingTF
即可获得基准模型的好方法,但它是非常简单的方法,特别是如果你不应用任何过滤步骤。如果您决定使用它,你至少应该增加的功能数量或使用默认值(2 ^ 20)
修改(适用于朴素贝叶斯与IDF preparing数据)
- 使用ML管道:
进口org.apache.spark.mllib.regression.LabeledPoint
进口org.apache.spark.mllib.linalg.Vector
进口org.apache.spark.ml.feature.IDF
进口org.apache.spark.sql.RowVAL标记生成器=?VAL hashingTF =新HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol(rawFeatures)VAL IDF =新IDF()
.setInputCol(hashingTF.getOutputCol)
.setOutputCol(特征)VAL管道=新管道()。setStages(阵列(标记生成器,hashingTF,IDF))
VAL模型= pipeline.fit(labeledData)模型
.transform(labeledData)
。选择($标签,$特色)
.MAP {行的情况下(标签:双,功能:矢量)=> LabeledPoint(标签功能)}
- 使用MLlib变压器:
进口org.apache.spark.mllib.feature.HashingTF
进口org.apache.spark.mllib.linalg.Vector
进口org.apache.spark.mllib.feature {IDF,IDFModel}VAL labeledData = wikiData.map(X =>
LabeledData(x.category,x.text,categoryMap.get(x.category).getOrElse(0.0)))VAL P =\\\\ W +:R
VAL原料= {labeledData.map
案例LabeledData(_,文本,标签)=> (标签,p.split(文本))}VAL hashingTF:org.apache.spark.mllib.feature.HashingTF =新HashingTF(1000)
VAL TF = {raw.map的情况下(标签,文本)=> (标签,hashingTF.transform(文本))}VAL IDF:org.apache.spark.mllib.feature.IDFModel =新IDF()符合(tf.map(_._ 2))
tf.map {
情况下(标签,rawFeatures)=> LabeledPoint(标签,idf.transform(rawFeatures))}
请注意:由于变压器需要JVM访问MLlib版本将不会在PySpark工作。如果preFER Python中,你必须拆分数据转换和压缩。
修改(适用于ML算法preparing数据):
虽然code的下面这段看起来有效的第一眼
VAL categoryMap =维基数据
.MAP(X => x.category)
。不同
.zipWithIndex
.mapValues(X => x.toDouble / 1000)
.collectAsMapVAL labeledData = wikiData.map(X => LabeledData(
x.category,x.text,categoryMap.get(x.category).getOrElse(0.0)))。toDF
也不会产生 ML
算法有效的标签。
所有 ML
首先希望标签在(0.0,1.0,...,n.0),其中n是类数。如果你的榜样管道所在班的一送0.001标签,你会得到一个错误这样的:
错误逻辑回归:分类标签应该在{0比0找到1无效的标签
。
块引用>显而易见的解决方案是为了避免分裂,当你生成映射
.mapValues(X => x.toDouble)
虽然它会为
逻辑回归
工作的其他ML
算法仍然会失败。例如用RandomForestClassifier
您将获得
RandomForestClassifier给予无效的标签栏标签输入,没有指定类的数量。见StringIndexer。
块引用>这是什么
RandomForestClassifier
有趣的ML版本,不像它的MLlib
对口,不提供设置方法若干类。原来预计特殊属性将在数据帧
列设置。最简单的方法是使用StringIndexer
中的错误消息中提到:进口org.apache.spark.ml.feature.StringIndexerVAL索引=新StringIndexer()
.setInputCol(类别)
.setOutputCol(标签)VAL管道=新管道()
.setStages(阵列(索引,分词,hashingTF,IDF,LR))VAL模型= pipeline.fit(wikiData.toDF)TL;DR; How do I use mllib to train my wiki data (text & category) for prediction against tweets?
I am VERY new when it comes to machine learning, so I am having trouble figuring out how to convert my tokenized wiki data so that it can be trained through either
NaiveBayes
orLogisticRegression
. My goal is to use the trained model for comparison against tweets*. I've tried using pipelines with LR andHashingTF
withIDF
forNaiveBayes
...but I keep getting wrong predictions. Here's what I've tried:*Note that I would like to use the many categories in the wiki data for my labels...I've only seen binary classification (it's one category or another)....is it possible to do what I want?
Pipeline w LR
import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext import org.apache.spark.ml.feature.HashingTF import org.apache.spark.mllib.linalg.Vector import org.apache.spark.ml.feature.RegexTokenizer case class WikiData(category: String, text: String) case class LabeledData(category: String, text: String, label: Double) val wikiData = sc.parallelize(List(WikiData("Spark", "this is about spark"), WikiData("Hadoop","then there is hadoop"))) val categoryMap = wikiData.map(x=>x.category).distinct.zipWithIndex.mapValues(x=>x.toDouble/1000).collectAsMap val labeledData = wikiData.map(x=>LabeledData(x.category, x.text, categoryMap.get(x.category).getOrElse(0.0))).toDF val tokenizer = new RegexTokenizer() .setInputCol("text") .setOutputCol("words") .setPattern("/W+") val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.01) val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr)) val model = pipeline.fit(labeledData) model.transform(labeledData).show
Naive Bayes
val hashingTF = new HashingTF() val tf: RDD[Vector] = hashingTF.transform(documentsAsWordSequenceAlready) import org.apache.spark.mllib.feature.IDF tf.cache() val idf = new IDF().fit(tf) val tfidf: RDD[Vector] = idf.transform(tf) tf.cache() val idf = new IDF(minDocFreq = 2).fit(tf) val tfidf: RDD[Vector] = idf.transform(tf) //to create tfidfLabeled (below) I ran a map set the labels...but again it seems to have to be 1.0 or 0.0? NaiveBayes.train(tfidfLabeled) .predict(hashingTF.transform(tweet)) .collect
解决方案ML
LogisticRegression
doesn't support multinomial classification yet, but it is supported by both MLLibNaiveBayes
andLogisticRegressionWithLBFGS
. In the first case it should work by default:import org.apache.spark.mllib.classification.NaiveBayes val nbModel = new NaiveBayes() .setModelType("multinomial") // This is default value .run(train)
but for logistic regression you should provide a number of classes:
import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS val model = new LogisticRegressionWithLBFGS() .setNumClasses(n) // Set number of classes .run(trainingData)
Regarding preprocessing steps it is a quite broad topic and it is hard to give you a meaningful advice without an access to your data so everything you find below is just a wild guess:
- as far I understand you use wiki data for training and tweets for testing. If that's true it is generally speaking a bad idea. You can expect that both sets use significantly different vocabulary, grammar and spelling
- simple regex tokenizer can perform pretty well on standardized text but from my experience it won't work well on informal text like tweets
HashingTF
can be a good way to obtain a baseline model but it is extremely simplified approach, especially if you don't apply any filtering steps. If you decide to use it you should at least increase number of features or use a default value (2^20)EDIT (Preparing data for Naive Bayes with IDF)
- using ML Pipelines:
import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vector import org.apache.spark.ml.feature.IDF import org.apache.spark.sql.Row val tokenizer = ??? val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("rawFeatures") val idf = new IDF() .setInputCol(hashingTF.getOutputCol) .setOutputCol("features") val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, idf)) val model = pipeline.fit(labeledData) model .transform(labeledData) .select($"label", $"features") .map{case Row(label: Double, features: Vector) => LabeledPoint(label, features)}
- using MLlib transformers:
import org.apache.spark.mllib.feature.HashingTF import org.apache.spark.mllib.linalg.Vector import org.apache.spark.mllib.feature.{IDF, IDFModel} val labeledData = wikiData.map(x => LabeledData(x.category, x.text, categoryMap.get(x.category).getOrElse(0.0))) val p = "\\W+".r val raw = labeledData.map{ case LabeledData(_, text, label) => (label, p.split(text))} val hashingTF: org.apache.spark.mllib.feature.HashingTF = new HashingTF(1000) val tf = raw.map{case (label, text) => (label, hashingTF.transform(text))} val idf: org.apache.spark.mllib.feature.IDFModel = new IDF().fit(tf.map(_._2)) tf.map{ case (label, rawFeatures) => LabeledPoint(label, idf.transform(rawFeatures))}
Note: Since transformers require JVM access MLlib version won't work in PySpark. If you prefer Python you have to split data transform and zip.
EDIT (Preparing data for ML algorithms):
While following piece of code looks valid at first glance
val categoryMap = wikiData .map(x=>x.category) .distinct .zipWithIndex .mapValues(x=>x.toDouble/1000) .collectAsMap val labeledData = wikiData.map(x=>LabeledData( x.category, x.text, categoryMap.get(x.category).getOrElse(0.0))).toDF
it won't generate valid labels for
ML
algorithms.First of all
ML
expects labels to be in (0.0, 1.0, ..., n.0) where n is number of classes. If your example pipeline where one of the classes get label 0.001 you'll get an error like this:ERROR LogisticRegression: Classification labels should be in {0 to 0 Found 1 invalid labels.
The obvious solution is to avoid division when you generate mapping
.mapValues(x=>x.toDouble)
While it will work for
LogisticRegression
otherML
algorithms will still fail. For example withRandomForestClassifier
you'll getRandomForestClassifier was given input with invalid label column label, without the number of classes specified. See StringIndexer.
What it interesting ML version of
RandomForestClassifier
, unlike itsMLlib
counterpart, doesn't provide a method to set a number of classes. Turns out it expects special attributes to be set on aDataFrame
column. The simplest approach is to useStringIndexer
mentioned in the error message:import org.apache.spark.ml.feature.StringIndexer val indexer = new StringIndexer() .setInputCol("category") .setOutputCol("label") val pipeline = new Pipeline() .setStages(Array(indexer, tokenizer, hashingTF, idf, lr)) val model = pipeline.fit(wikiData.toDF)
这篇关于如何prepare在mllib训练数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!