SPARK:如何为来自 LabeledPoint 的决策树创建 categoricalFeaturesInfo? [英] SPARK: How to create categoricalFeaturesInfo for decision trees from LabeledPoint?

查看:33
本文介绍了SPARK:如何为来自 LabeledPoint 的决策树创建 categoricalFeaturesInfo?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个关于女巫的 LabeledPoint 我想运行决策树(以及后来的随机森林)

I've got a LabeledPoint on witch I want to run a decision tree (and later random forest)

scala> transformedData.collect
res8: Array[org.apache.spark.mllib.regression.LabeledPoint] = Array((0.0,(400036,[7744],[2.0])), (0.0,(400036,[7744,8608],[3.0,3.0])), (0.0,(400036,[7744],[2.0])), (0.0,(400036,[133,218,2162,7460,7744,9567],[1.0,1.0,2.0,1.0,42.0,21.0])), (0.0,(400036,[133,218,1589,2162,2784,2922,3274,6914,7008,7131,7460,8608,9437,9567,199999,200021,200035,200048,200051,200056,200058,200064,200070,200072,200075,200087,400008,400011],[4.0,1.0,6.0,53.0,6.0,1.0,1.0,2.0,11.0,17.0,48.0,3.0,4.0,113.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,28.0,1.0,1.0,1.0,1.0,1.0,4.0])), (0.0,(400036,[1589,3585,4830,6935,6936,7744,400008,400011],[2.0,6.0,3.0,52.0,4.0,3.0,1.0,2.0])), (0.0,(400036,[1589,2162,2784,2922,4123,7008,7131,7792,8608],[23.0,70.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0])), (0.0,(400036,[4830,6935,6936,400008,400011],[1.0,36.0,...

使用代码:

import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.tree.impurity.Gini

val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]() //change to what?
val impurity = "gini"
val maxDepth = 5
val maxBins = 32

val model = DecisionTree.trainClassifier(
  trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)

在我的数据中,我有两种类型的特征:

In my data I've got two types of features:

  1. 某些特征是对给定网站/域的用户访问计数(特征是网站/域,其值是访问次数)

  1. some features are counts from user visits on a given website/domain (feature is a website/domain and its value is number of visits)

其余功能是一些声明性变量 - 二元/分类

rest of the features are some declarative variables - binary/categorical

有没有办法从 LabeledPoint 自动创建 categoricalFeaturesInfo?我想检查我的声明变量(类型 2)的级别,然后让这些信息创建 categoricalFeaturesInfo.

Is there a way to create categoricalFeaturesInfo automatically from LabeledPoint? I want to check the levels of my declarative variables (type 2), then having this information create categoricalFeaturesInfo.

我有一个包含声明性变量的列表:

I have a list with the the declarative variables:

List(6363,21345,23455,...

推荐答案

categoricalFeaturesInfo 应该从一个索引映射到一个给定特征的多个类.一般来说,识别分类变量可能很昂贵,尤其是当这些变量与连续变量严重混合时.此外,根据您的数据,它可以给出假阳性和假阴性.请记住,最好手动设置这些值.

categoricalFeaturesInfo should map from an index to a number of classes for a given feature. Generally speaking identifying categorical variables can be expensive, especially if these are heavily mixed with continuous variables. Moreover, depending on your data, it can give both false positive and false negatives. Keeping that in mind it is better to set these values manually.

如果您仍想自动创建 categoricalFeaturesInfo,您可以查看 ml.feature.VectorIndexer.它不直接适用于这种情况,但应该提供有用的代码库来构建您自己的解决方案.

If you still want to create categoricalFeaturesInfo automatically you can take a look at the ml.feature.VectorIndexer. It is not directly applicable in this case but should provide an useful code base to build your own solution.

这篇关于SPARK:如何为来自 LabeledPoint 的决策树创建 categoricalFeaturesInfo?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆