SPARK:如何为来自 LabeledPoint 的决策树创建 categoricalFeaturesInfo? [英] SPARK: How to create categoricalFeaturesInfo for decision trees from LabeledPoint?

查看：33 发布时间：2021/11/14 21:09:50 scala apache-spark random-forest decision-tree apache-spark-mllib

本文介绍了SPARK:如何为来自 LabeledPoint 的决策树创建 categoricalFeaturesInfo?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个关于女巫的 LabeledPoint 我想运行决策树(以及后来的随机森林)

I've got a LabeledPoint on witch I want to run a decision tree (and later random forest)

scala> transformedData.collect
res8: Array[org.apache.spark.mllib.regression.LabeledPoint] = Array((0.0,(400036,[7744],[2.0])), (0.0,(400036,[7744,8608],[3.0,3.0])), (0.0,(400036,[7744],[2.0])), (0.0,(400036,[133,218,2162,7460,7744,9567],[1.0,1.0,2.0,1.0,42.0,21.0])), (0.0,(400036,[133,218,1589,2162,2784,2922,3274,6914,7008,7131,7460,8608,9437,9567,199999,200021,200035,200048,200051,200056,200058,200064,200070,200072,200075,200087,400008,400011],[4.0,1.0,6.0,53.0,6.0,1.0,1.0,2.0,11.0,17.0,48.0,3.0,4.0,113.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,28.0,1.0,1.0,1.0,1.0,1.0,4.0])), (0.0,(400036,[1589,3585,4830,6935,6936,7744,400008,400011],[2.0,6.0,3.0,52.0,4.0,3.0,1.0,2.0])), (0.0,(400036,[1589,2162,2784,2922,4123,7008,7131,7792,8608],[23.0,70.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0])), (0.0,(400036,[4830,6935,6936,400008,400011],[1.0,36.0,...

使用代码:

import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.tree.impurity.Gini

val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]() //change to what?
val impurity = "gini"
val maxDepth = 5
val maxBins = 32

val model = DecisionTree.trainClassifier(
  trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)

在我的数据中，我有两种类型的特征:

In my data I've got two types of features:

某些特征是对给定网站/域的用户访问计数(特征是网站/域，其值是访问次数)

some features are counts from user visits on a given website/domain (feature is a website/domain and its value is number of visits)

其余功能是一些声明性变量 - 二元/分类

rest of the features are some declarative variables - binary/categorical

有没有办法从 LabeledPoint 自动创建 categoricalFeaturesInfo?我想检查我的声明变量(类型 2)的级别，然后让这些信息创建 categoricalFeaturesInfo.

Is there a way to create categoricalFeaturesInfo automatically from LabeledPoint? I want to check the levels of my declarative variables (type 2), then having this information create categoricalFeaturesInfo.

我有一个包含声明性变量的列表:

I have a list with the the declarative variables:

List(6363,21345,23455,...

SPARK:如何为来自 LabeledPoint 的决策树创建 categoricalFeaturesInfo? [英] SPARK: How to create categoricalFeaturesInfo for decision trees from LabeledPoint?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

SPARK:如何为来自 LabeledPoint 的决策树创建 categoricalFeaturesInfo? [英] SPARK: How to create categoricalFeaturesInfo for decision trees from LabeledPoint?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭