SPARK:如何创建categoricalFeaturesInfo从LabeledPoint决策树? [英] SPARK: How to create categoricalFeaturesInfo for decision trees from LabeledPoint?
问题描述
我有一个 LabeledPoint
上女巫我想运行一个决策树(以及后来的随机森林)
斯卡拉> transformedData.collect
res8:数组[org.apache.spark.mllib.regression.LabeledPoint] =阵列((0.0(400036,[7744],[2.0])),(0.0,(400036,[7744,8608],[3.0, 3.0)),(0.0,(400036,[7744],[2.0])),(0.0,(400036 [133,218,2162,7460,7744,9567],[1.0,1.0,2.0,1.0,42.0, 21.0])) (0.0,(400036,[133,218,1589,2162,2784,2922,3274,6914,7008,7131,7460,8608,9437,9567,199999,200021,200035,200048,200051,200056,200058,200064,200070,200072,200075,200087,400008,400011],[4.0,1.0,6.0,53.0,6.0,1.0,1.0,2.0,11.0,17.0,48.0,3.0,4.0,113.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,28.0,1.0,1.0,1.0,1.0,1.0,4.0])), (0.0,(400036,[1589,3585,4830,6935,6936,7744,400008,400011],[2.0,6.0,3.0,52.0,4.0,3.0,1.0,2.0])), (0.0,(400036,[1589,2162,2784,2922,4123,7008,7131,7792,8608],[23.0,70.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0])), (0.0,(400036 [4830,6935,6936,400008,400011],[1.0,36.0,...
使用code
进口org.apache.spark.mllib.tree.DecisionTree
进口org.apache.spark.mllib.tree.model.DecisionTreeModel
进口org.apache.spark.mllib.util.MLUtils
进口org.apache.spark.mllib.tree.impurity.GiniVAL numClasses = 2
VAL categoricalFeaturesInfo =地图[INT,INT]()//改变的是什么?
VAL杂质=基尼
VAL MAXDEPTH = 5
VAL maxBins = 32VAL模型= DecisionTree.trainClassifier(
trainingData,numClasses,categoricalFeaturesInfo,杂质,MAXDEPTH,maxBins)
在我的数据我有两种类型的特点:
- 有些功能是从给定的网站/域用户访问(功能是一个网站/域,它的值是访问次数) 计数
-
的特点其余的都是一些声明的变量 - 二进制/分类
有没有一种方法来创建
categoricalFeaturesInfo
自动LabeledPoint
?我想检查我的声明变量(类型2)的水平,然后有这个信息来创建categoricalFeaturesInfo
。
我与声明的变量列表:
列表(6363,21345,23455,...
categoricalFeaturesInfo
应该从索引映射到一些类对于一个给定的功能。一般来说识别分类变量可以是昂贵的,特别是如果这些都是连续变量重混。此外,根据您的数据,它可以给这两个假阳性和假阴性。牢记这一点,最好是手动设置这些值。
如果您仍然希望创建 categoricalFeaturesInfo
自动,您可以看看的<一个href=\"https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.VectorIndexer\"相对=nofollow> ml.feature.VectorIndexer
。它不直接适用于这种情况,但应该提供一个有用的code群中建立起自己的解决方案。
I've got a LabeledPoint
on witch I want to run a decision tree (and later random forest)
scala> transformedData.collect
res8: Array[org.apache.spark.mllib.regression.LabeledPoint] = Array((0.0,(400036,[7744],[2.0])), (0.0,(400036,[7744,8608],[3.0,3.0])), (0.0,(400036,[7744],[2.0])), (0.0,(400036,[133,218,2162,7460,7744,9567],[1.0,1.0,2.0,1.0,42.0,21.0])), (0.0,(400036,[133,218,1589,2162,2784,2922,3274,6914,7008,7131,7460,8608,9437,9567,199999,200021,200035,200048,200051,200056,200058,200064,200070,200072,200075,200087,400008,400011],[4.0,1.0,6.0,53.0,6.0,1.0,1.0,2.0,11.0,17.0,48.0,3.0,4.0,113.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,28.0,1.0,1.0,1.0,1.0,1.0,4.0])), (0.0,(400036,[1589,3585,4830,6935,6936,7744,400008,400011],[2.0,6.0,3.0,52.0,4.0,3.0,1.0,2.0])), (0.0,(400036,[1589,2162,2784,2922,4123,7008,7131,7792,8608],[23.0,70.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0])), (0.0,(400036,[4830,6935,6936,400008,400011],[1.0,36.0,...
using code:
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.tree.impurity.Gini
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]() //change to what?
val impurity = "gini"
val maxDepth = 5
val maxBins = 32
val model = DecisionTree.trainClassifier(
trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)
In my data I've got two types of features:
- some features are counts from user visits on a given website/domain (feature is a website/domain and its value is number of visits)
rest of the features are some declarative variables - binary/categorical
Is there a way to create
categoricalFeaturesInfo
automatically fromLabeledPoint
? I want to check the levels of my declarative variables (type 2), then having this information createcategoricalFeaturesInfo
.
I have a list with the the declarative variables:
List(6363,21345,23455,...
categoricalFeaturesInfo
should map from an index to a number of classes for a given feature. Generally speaking identifying categorical variables can be expensive, especially if these are heavily mixed with continuous variables. Moreover, depending on your data, it can give both false positive and false negatives. Keeping that in mind it is better to set these values manually.
If you still want to create categoricalFeaturesInfo
automatically you can take a look at the ml.feature.VectorIndexer
. It is not directly applicable in this case but should provide an useful code base to build your own solution.
这篇关于SPARK:如何创建categoricalFeaturesInfo从LabeledPoint决策树?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!