问题与Spark MLLib导致概率和预测对所有内容都是相同的 [英] Issue with Spark MLLib that causes probability and prediction to be the same for everything

查看：258 发布时间：2018/5/31 20:01:24 python hadoop apache-spark apache-spark-mllib sentiment-analysis

本文介绍了问题与Spark MLLib导致概率和预测对所有内容都是相同的的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在学习如何使用Spark MLLib进行机器学习，目的是对Tweets进行情感分析。我从这里得到了一个Sentiment Analysis数据集：
http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip

该数据集包含100万分类为正面或负面的推文。此数据集的第二列包含情感，第四列包含推文。

这是我目前的PySpark代码：
从pyspark.sql导入csv 从pyspark.sql.functions导入行从pyspark.ml.feature导入rand import Tokenizer $来自pyspark.ml.feature的b $ b从pyspark.ml.feature导入StopWordsRemover 从pyspark.ml.feature导入Word2Vec 从pyspark.ml.classification导入CountVectorizer 导入LogisticRegression data = sc.textFile（/ home / omar / sentiment-train.csv） header = data.first（） rdd = data.filter（lambda row：row！ = header） r = rdd.mapPartitions（lambda x：csv.reader（x）） r2 = r.map（lambda x：（x [3]，int（x [1 ]））） parts = r2.map（lambda x：Row（sentence = x [0]，label = int（x [1]））） partsDF = spark.createDataFrame （parts） partsDF = partsDF.orderBy（rand（））。limit（10000） tokenizer = Tokenizer（inputCol =sentence，outputCol =words） tokenized = tokenizer.transform（partsDF） remover = StopWordsRemover（inputCol =words，outputCol =base_words） base_words = remover.transform （标记化） train_data_raw = base_words.select（base_words，label） word2Vec = Word2Vec（vectorSize = 100，minCount = 0，inputCol =base_words ，outputCol =features） model = word2Vec.fit（train_data_raw） final_train_data = model.transform（train_data_raw） final_train_data = final_train_data.select（label，特征） lr = LogisticRegression（maxIter = 10，regParam = 0.3，elasticNetParam = 0.8） lrModel = lr.fit（final_train_data） lrModel.transform （final_train_data）.show（）
我使用以下命令在PySpark交互式shell上执行此操作：
pyspark - 主纱 - 部署模式客户端--conf ='spark.executorEnv.PYTHONHASHSEED = 223'
（仅供参考：我有一个10 V的HDFS集群Ms，YARN，Spark等）

作为最后一行代码的结果，这就是发生了什么：
>>> lrModel.transform（final_train_data）.show（） + ----- + -------------------- + -------- ------------ + -------------------- + ---------- + |标签|功能| rawPrediction |概率|预测| + ----- + -------------------- + ------------------ - + -------------------- + ---------- + | 1 | [0.00885206627292 ... | [-0.0332030500349 ... | [0.4917,0.5083000 ... | 1.0 | | 1 | [0.02994908031541 ... | [-0.0332030500349 ... | [0.4917,0.5083000 ... | 1.0 | | 1 | [0.03443818541709 ... | [-0.0332030500349 ... | [0.4917,0.5083000 ... | 1.0 | | 0 | [0.02838905728422 ... | [-0.0332030500349 ... | [0.4917,0.5083000 ... | 1.0 | | 1 | [0.00561632859171 ... | [-0.0332030500349 ... | [0.4917,0.5083000 ... | 1.0 | | 0 | [0.02029798456545 ... | [-0.0332030500349 ... | [0.4917,0.5083000 ... | 1.0 | | 1 | [0.02020387646293 ... | [-0.0332030500349 ... | [0.4917,0.5083000 ... | 1.0 | | 1 | [0.01861085715063 ... | [-0.0332030500349 ... | [0.4917,0.5083000 ... | 1.0 | | 1 | [0.00212163510598 ... | [-0.0332030500349 ... | [0.4917,0.5083000 ... | 1.0 | | 0 | [0.01254413221031 ... | [-0.0332030500349 ... | [0.4917,0.5083000 ... | 1.0 | | 0 | [0.01443821341672 ... | [-0.0332030500349 ... | [0.4917,0.5083000 ... | 1.0 | | 0 | [0.02591390228879 ... | [-0.0332030500349 ... | [0.4917,0.5083000 ... | 1.0 | | 1 | [0.00590923184063 ... | [-0.0332030500349 ... | [0.4917,0.5083000 ... | 1.0 | | 0 | [0.02487089103516 ... | [-0.0332030500349 ... | [0.4917,0.5083000 ... | 1.0 | | 0 | [0.00999667861365 ... | [-0.0332030500349 ... | [0.4917,0.5083000 ... | 1.0 | | 0 | [0.00416736607439 ... | [-0.0332030500349 ... | [0.4917,0.5083000 ... | 1.0 | | 0 | [0.00715923445144 ... | [-0.0332030500349 ... | [0.4917,0.5083000 ... | 1.0 | | 0 | [0.02524911996890 ... | [-0.0332030500349 ... | [0.4917,0.5083000 ... | 1.0 | | 1 | [0.01635813603934 ... | [-0.0332030500349 ... | [0.4917,0.5083000 ... | 1.0 | | 0 | [0.02773649083489 ... | [-0.0332030500349 ... | [0.4917,0.5083000 ... | 1.0 | + ----- + -------------------- + ------------------ - + -------------------- + ---------- + 只显示前20行
如果我对手动创建的较小数据集做同样的处理，我不知道发生了什么，一直在这一天工作。

有什么建议吗？

感谢您的时间！
解决方案
TL; DR 生活应用。在大的和非平凡的数据集上，它可能需要数千次或更多的迭代（以及调整剩余的参数）才能收敛。

二项式 LogisticRegressionModel code> 有 summary 属性，可以让您访问 LogisticRegressionSummary 对象。在其他有用的指标中，它包含 objectiveHistory ，可用于调试培训流程：
import matplotlib.pyplot as plt lrm = LogisticRegression（...，family =binomial）。fit（df） plt.plot（lrm.summary.objectiveHistory） plt.show（）
I'm learning how to use Machine Learning with Spark MLLib with the purpose of doing Sentiment Analysis of Tweets. I got a Sentiment Analysis dataset from here: http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip
That dataset contains 1 million of tweets classified as Positive or Negative. The second column of this dataset contains the sentiment and the fourth column contains the tweet. This is my current PySpark code: import csv from pyspark.sql import Row from pyspark.sql.functions import rand from pyspark.ml.feature import Tokenizer from pyspark.ml.feature import StopWordsRemover from pyspark.ml.feature import Word2Vec from pyspark.ml.feature import CountVectorizer from pyspark.ml.classification import LogisticRegression data = sc.textFile("/home/omar/sentiment-train.csv") header = data.first() rdd = data.filter(lambda row: row != header) r = rdd.mapPartitions(lambda x : csv.reader(x)) r2 = r.map(lambda x: (x[3], int(x[1]))) parts = r2.map(lambda x: Row(sentence=x[0], label=int(x[1]))) partsDF = spark.createDataFrame(parts) partsDF = partsDF.orderBy(rand()).limit(10000) tokenizer = Tokenizer(inputCol="sentence", outputCol="words") tokenized = tokenizer.transform(partsDF) remover = StopWordsRemover(inputCol="words", outputCol="base_words") base_words = remover.transform(tokenized) train_data_raw = base_words.select("base_words", "label") word2Vec = Word2Vec(vectorSize=100, minCount=0, inputCol="base_words", outputCol="features") model = word2Vec.fit(train_data_raw) final_train_data = model.transform(train_data_raw) final_train_data = final_train_data.select("label", "features") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) lrModel = lr.fit(final_train_data) lrModel.transform(final_train_data).show() I'm executing this on PySpark interactive shell using this command: pyspark --master yarn --deploy-mode client --conf='spark.executorEnv.PYTHONHASHSEED=223' (FYI: I have a HDFS cluster with 10 VMs with YARN, Spark, etc) As a result of the last line of code, this is what happens: >>> lrModel.transform(final_train_data).show() +-----+--------------------+--------------------+--------------------+----------+ |label| features| rawPrediction| probability|prediction| +-----+--------------------+--------------------+--------------------+----------+ | 1|[0.00885206627292...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0| | 1|[0.02994908031541...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0| | 1|[0.03443818541709...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0| | 0|[0.02838905728422...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0| | 1|[0.00561632859171...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0| | 0|[0.02029798456545...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0| | 1|[0.02020387646293...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0| | 1|[0.01861085715063...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0| | 1|[0.00212163510598...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0| | 0|[0.01254413221031...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0| | 0|[0.01443821341672...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0| | 0|[0.02591390228879...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0| | 1|[0.00590923184063...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0| | 0|[0.02487089103516...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0| | 0|[0.00999667861365...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0| | 0|[0.00416736607439...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0| | 0|[0.00715923445144...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0| | 0|[0.02524911996890...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0| | 1|[0.01635813603934...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0| | 0|[0.02773649083489...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0| +-----+--------------------+--------------------+--------------------+----------+ only showing top 20 rows If I do the same with a smaller dataset that I have created manually it works. I don't know what is happening, have been working with this thru the day. Any suggestions? Thanks for your time! 解决方案 TL;DR Ten iterations is way to low for any real life applications. On large and non-trivial datasets it can take thousand or more iterations (as well as tuning remaining parameters) to converge. Binomial LogisticRegressionModel has summary attribute, which can give you an access to a LogisticRegressionSummary object. Among other useful metrics it contains objectiveHistory which can be used to debug training process: import matplotlib.pyplot as plt lrm = LogisticRegression(..., family="binomial").fit(df) plt.plot(lrm.summary.objectiveHistory) plt.show() 这篇关于问题与Spark MLLib导致概率和预测对所有内容都是相同的的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

问题与Spark MLLib导致概率和预测对所有内容都是相同的 [英] Issue with Spark MLLib that causes probability and prediction to be the same for everything

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

问题与Spark MLLib导致概率和预测对所有内容都是相同的 [英] Issue with Spark MLLib that causes probability and prediction to be the same for everything

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭