在Mllib中工作时如何保留记录信息 [英] how to keep records information when working in Mllib

查看：53 发布时间：2020/9/4 7:40:50 apache-spark apache-spark-mllib

本文介绍了在Mllib中工作时如何保留记录信息的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在处理必须使用mllib库的分类问题. mllib中的分类算法(例如Logistic回归)需要RDD [LabeledPoint]. LabeledPoint只有两个字段，即标签和特征向量.在进行评分时(将我训练有素的模型应用于测试集)，我的测试实例还有一些我想保留的字段.例如，测试实例看起来像这样的<id, field1, field2, label, features>.当我创建一个LabeledPoint的RDD时，所有其他字段(id，field1和field2)都消失了，我无法在计分的实例和原始实例之间建立关系.我该如何解决这个问题.得分之后，我需要知道ID和score/predicted_label.

I'm working on a classification problem in which I have to use mllib library. The classification algorithms (let's say Logistic Regression) in mllib require an RDD[LabeledPoint]. A LabeledPoint has only two fields, a label and a feature vector. When doing the scoring (applying my trained model on the test set), my test instances have a few other fields that I'd like to keep. For example, a test instance looks like this <id, field1, field2, label, features>. When I create an RDD of LabeledPoint all the other fields (id,field1 and field2) are gone and I can't make the relation between my scored instance and the original one. How can I solved this issue. After the scoring, I need to know the ids' and the score/predicted_label.

此问题在ML中不存在，因为它使用DataFrame，我可以简单地在原始数据框中添加带有分数的另一列.

This problem doesn't exist in ML as it uses DataFrame and I can simply add another column with the score to my original dataframe.

推荐答案

对您的问题的解决方案是RDD的map方法保留顺序.因此，您可以将RDD.zip方法与ID一起使用.

A solution to your problem is that the map method of RDD retains order; therefore, you can use the RDD.zip method with the id's.

以下是显示过程的答案

从数据帧中闪烁MLLib Kmeans，然后再次返回

以RDD的形式获取ID和集群对非常容易:

It's very easy to obtain pairs of ids and clusters in form of RDD:

val idPointRDD = data.rdd.map(s => (s.getInt(0),
     Vectors.dense(s.getDouble(1),s.getDouble(2)))).cache()
val clusters = KMeans.train(idPointRDD.map(_._2), 3, 20)
val clustersRDD = clusters.predict(idPointRDD.map(_._2))
val idClusterRDD = idPointRDD.map(_._1).zip(clustersRDD)

然后您从中创建DataFrame

Then you create DataFrame from that

val idCluster = idClusterRDD.toDF("id", "cluster")

之所以起作用，是因为map不会更改RDD中数据的顺序，即为什么只用预测结果压缩ID.

It works because map doesn't change order of the data in RDD, which is why you can just zip ids with results of prediction.

这篇关于在Mllib中工作时如何保留记录信息的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Mllib中工作时如何保留记录信息 [英] how to keep records information when working in Mllib

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在Mllib中工作时如何保留记录信息 [英] how to keep records information when working in Mllib

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭