如何使用 Scala 运行具有分类特征集的 Spark 决策树? [英] How do I run the Spark decision tree with a categorical feature set using Scala?

查看：30 发布时间：2021/11/14 21:05:02 scala apache-spark tree apache-spark-mllib categorical-data

本文介绍了如何使用 Scala 运行具有分类特征集的 Spark 决策树?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个具有相应 categoricalFeaturesInfo: Map[Int,Int] 的特征集.但是，对于我的生活，我无法弄清楚应该如何让 DecisionTree 类工作.它不会接受任何东西，而是接受 LabeledPoint 作为数据.但是，LabeledPoint 需要 (double, vector) ，其中向量需要双精度.

I have a feature set with a corresponding categoricalFeaturesInfo: Map[Int,Int]. However, for the life of me I cannot figure out how I am supposed to get the DecisionTree class to work. It will not accept anything, but a LabeledPoint as data. However, LabeledPoint requires (double, vector) where the vector requires doubles.

val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))

// Run training algorithm to build the model
val maxDepth: Int = 3
val isMulticlassWithCategoricalFeatures: Boolean = true
val numClassesForClassification: Int = countPossibilities(labelCol) 
val model = DecisionTree.train(LP, Classification, Gini, isMulticlassWithCategoricalFeatures, maxDepth, numClassesForClassification,categoricalFeaturesInfo)

我得到的错误:

scala> val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))
<console>:32: error: overloaded method value dense with alternatives:
  (values: Array[Double])org.apache.spark.mllib.linalg.Vector <and>
  (firstValue: Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector
 cannot be applied to (Array[String])
       val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))

我目前的资源:树配置，决策树， labeledpoint

推荐答案

您可以先将类别转换为数字，然后像所有要素都是数字一样加载数据.

You can first transform categories to numbers, then load data as if all features are numerical.

当您在 Spark 中构建决策树模型时，您只需要通过指定地图来告诉 Spark 哪些特征是分类特征以及特征的数量(该特征的不同类别的数量)Map[Int, Int]() 从特征索引到其数量.

When you build a decision tree model in Spark, you just need to tell spark which features are categorical and also the feature's arity (the number of distinct categories of that feature) by specifying a map Map[Int, Int]() from feature indices to its arity.

例如，如果您的数据为:

For example if you have data as:

1,a,add
2,b,more
1,c,thinking
3,a,to
1,c,me

您可以先将数据转换为数字格式:

You can first transform data into numerical format as:

1,0,0
2,1,1
1,2,2
3,0,3
1,2,4

以这种格式，您可以将数据加载到 Spark.然后如果你想告诉 Spark 第二和第三列是分类的，你应该创建一个映射:

In that format you can load data to Spark. Then if you want to tell Spark the second and the third columns are categorical, you should create a map:

categoricalFeaturesInfo = Map[Int, Int]((1,3),(2,5))

该地图告诉我们，索引为 1 的特征的数量为 3，索引为 2 的特征的数量为 5.当我们构建决策树模型并将该地图作为训练函数的参数传递时，它们将被视为分类的:

The map tells us that feature with index 1 has arity 3, and feature with index 2 has artity 5. They will be considered as categorical when we build a decision tree model passing that map as a parameter of the training function:

val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)

这篇关于如何使用 Scala 运行具有分类特征集的 Spark 决策树?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用 Scala 运行具有分类特征集的 Spark 决策树? [英] How do I run the Spark decision tree with a categorical feature set using Scala?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用 Scala 运行具有分类特征集的 Spark 决策树? [英] How do I run the Spark decision tree with a categorical feature set using Scala?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭