SPARK:如何创建categoricalFeaturesInfo从LabeledPoint决策树? [英] SPARK: How to create categoricalFeaturesInfo for decision trees from LabeledPoint?

查看:2695
本文介绍了SPARK:如何创建categoricalFeaturesInfo从LabeledPoint决策树?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 LabeledPoint 上女巫我想运行一个决策树(以及后来的随机森林)

 斯卡拉> transformedData.collect
res8:数组[org.apache.spark.mllib.regression.LabeledPoint] =阵列((0.0(400036,[7744],[2.0])),(0.0,(400036,[7744,8608],[3.0, 3.0)),(0.0,(400036,[7744],[2.0])),(0.0,(400036 [133,218,2162,7460,7744,9567],[1.0,1.0,2.0,1.0,42.0, 21.0])) (0.0,(400036,[133,218,1589,2162,2784,2922,3274,6914,7008,7131,7460,8608,9437,9567,199999,200021,200035,200048,200051,200056,200058,200064,200070,200072,200075,200087,400008,400011],[4.0,1.0,6.0,53.0,6.0,1.0,1.0,2.0,11.0,17.0,48.0,3.0,4.0,113.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,28.0,1.0,1.0,1.0,1.0,1.0,4.0])), (0.0,(400036,[1589,3585,4830,6935,6936,7744,400008,400011],[2.0,6.0,3.0,52.0,4.0,3.0,1.0,2.0])), (0.0,(400036,[1589,2162,2784,2922,4123,7008,7131,7792,8608],[23.0,70.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0])), (0.0,(400036 [4830,6935,6936,400008,400011],[1.0,36.0,...

使用code

 进口org.apache.spark.mllib.tree.DecisionTree
进口org.apache.spark.mllib.tree.model.DecisionTreeModel
进口org.apache.spark.mllib.util.MLUtils
进口org.apache.spark.mllib.tree.impurity.GiniVAL numClasses = 2
VAL categoricalFeaturesInfo =地图[INT,INT]()//改变的是什么?
VAL杂质=基尼
VAL MAXDEPTH = 5
VAL maxBins = 32VAL模型= DecisionTree.trainClassifier(
  trainingData,numClasses,categoricalFeaturesInfo,杂质,MAXDEPTH,maxBins)

在我的数据我有两种类型的特点:


  1. 有些功能是从给定的网站/域用户访问(功能是一个网站/域,它的值是访问次数)
  2. 计数
  3. 的特点其余的都是一些声明的变量 - 二进制/分类

    有没有一种方法来创建 categoricalFeaturesInfo 自动 LabeledPoint ?我想检查我的声明变量(类型2)的水平,然后有这个信息来创建 categoricalFeaturesInfo


我与声明的变量列表:

 列表(6363,21345,23455,...


解决方案

categoricalFeaturesInfo 应该从索引映射到一些类对于一个给定的功能。一般来说识别分类变量可以是昂贵的,特别是如果这些都是连续变量重混。此外,根据您的数据,它可以给这两个假阳性和假阴性。牢记这一点,最好是手动设置这些值。

如果您仍然希望创建 categoricalFeaturesInfo 自动,您可以看看的<一个href=\"https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.VectorIndexer\"相对=nofollow> ml.feature.VectorIndexer 。它不直接适用于这种情况,但应该提供一个有用的code群中建立起自己的解决方案。

I've got a LabeledPoint on witch I want to run a decision tree (and later random forest)

scala> transformedData.collect
res8: Array[org.apache.spark.mllib.regression.LabeledPoint] = Array((0.0,(400036,[7744],[2.0])), (0.0,(400036,[7744,8608],[3.0,3.0])), (0.0,(400036,[7744],[2.0])), (0.0,(400036,[133,218,2162,7460,7744,9567],[1.0,1.0,2.0,1.0,42.0,21.0])), (0.0,(400036,[133,218,1589,2162,2784,2922,3274,6914,7008,7131,7460,8608,9437,9567,199999,200021,200035,200048,200051,200056,200058,200064,200070,200072,200075,200087,400008,400011],[4.0,1.0,6.0,53.0,6.0,1.0,1.0,2.0,11.0,17.0,48.0,3.0,4.0,113.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,28.0,1.0,1.0,1.0,1.0,1.0,4.0])), (0.0,(400036,[1589,3585,4830,6935,6936,7744,400008,400011],[2.0,6.0,3.0,52.0,4.0,3.0,1.0,2.0])), (0.0,(400036,[1589,2162,2784,2922,4123,7008,7131,7792,8608],[23.0,70.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0])), (0.0,(400036,[4830,6935,6936,400008,400011],[1.0,36.0,...

using code:

import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.tree.impurity.Gini

val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]() //change to what?
val impurity = "gini"
val maxDepth = 5
val maxBins = 32

val model = DecisionTree.trainClassifier(
  trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)

In my data I've got two types of features:

  1. some features are counts from user visits on a given website/domain (feature is a website/domain and its value is number of visits)
  2. rest of the features are some declarative variables - binary/categorical

    Is there a way to create categoricalFeaturesInfo automatically from LabeledPoint? I want to check the levels of my declarative variables (type 2), then having this information create categoricalFeaturesInfo.

I have a list with the the declarative variables:

List(6363,21345,23455,...

解决方案

categoricalFeaturesInfo should map from an index to a number of classes for a given feature. Generally speaking identifying categorical variables can be expensive, especially if these are heavily mixed with continuous variables. Moreover, depending on your data, it can give both false positive and false negatives. Keeping that in mind it is better to set these values manually.

If you still want to create categoricalFeaturesInfo automatically you can take a look at the ml.feature.VectorIndexer. It is not directly applicable in this case but should provide an useful code base to build your own solution.

这篇关于SPARK:如何创建categoricalFeaturesInfo从LabeledPoint决策树?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆