管道中的Spark MLLib 2.0分类功能 [英] Spark MLLib 2.0 Categorical Features in pipeline

查看:183
本文介绍了管道中的Spark MLLib 2.0分类功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试基于日志文件构建决策树.一些功能集很大,包含数千个唯一值.我正在尝试在Java中使用管道和数据框架的新用法.我为每个分类功能列构建了一个具有几个StringIndexer管道阶段的管道.然后,我使用VectorAssembler创建特征向量.在VectorAssembler阶段之后,生成的数据帧对我来说看起来很完美.我的管道看起来像

I'm trying to build a decision tree based on log files. Some feature sets are large containing thousands of unique values. I'm trying to use the new idioms of pipeline and data frame in Java. I've built a pipeline with several StringIndexer pipeline stages for each of the categorical feature columns. Then I use a VectorAssembler to create a features vector. The resultant data frame looks perfect to me after the VectorAssembler stage. My pipeline looks approximately like

StringIndexer-> StringIndexer-> StringIndexer-> VectorAssembler-> DecisionTreeClassifier

StringIndexer-> StringIndexer-> StringIndexer->VectorAssembler->DecisionTreeClassifier

但是我遇到以下错误:

DecisionTree要求maxBins(= 32)至少与每个分类要素中的值数量一样大,但是分类要素5具有49个值.考虑删除具有大量值的此分类功能和其他分类功能,或添加更多训练示例.

DecisionTree requires maxBins (= 32) to be at least as large as the number of values in each categorical feature, but categorical feature 5 has 49 values. Considering remove this and other categorical features with a large number of values, or add more training examples.

我可以使用Normalizer解决此问题,但是结果决策树无法满足我的需要,因为我需要生成具有原始特征值的DSL决策树.我不能手动设置maxBins,因为整个管道是一起执行的.我希望结果决策树具有StringIndexer生成的值(例如Feature 5< = 132).此外,但不太重要,我希望能够为功能指定自己的名称(例如,代替功能5",例如域")

I can resolve this issue by using a Normalizer, but then the resultant Decision tree is unusable for my needs, as I need to generate a DSL decision tree with the original feature values. I can't manually set the maxBins because the whole pipeline is executed together. I would like the resultant decision tree to have the StringIndexer generated values (e.g. Feature 5 <= 132). Additionally, but less important, I'd like to be able to specify my own names for the features (e.g. instead of 'Feature 5', say 'domain')

推荐答案

DecisionTree算法采用单个maxBins值来确定要进行的拆分次数.默认值为(= 32). maxBins应大于或等于分类特征的最大类别数.由于功能5具有49个不同的值,因此您需要将maxBins增加到49或更大.

The DecisionTree algorithm takes a single maxBins value to decide the number of splits to take. The default value is (=32). maxBins should be greater or equal to the maximum number of categories for categorical features. Since your feature 5 has 49 different values you need to increase maxBins to 49 or greater.

DecisionTree算法具有多个超参数,将它们调整为数据可以提高准确性.您可以使用Spark的交叉验证框架进行此调整,该框架会自动测试超参数网格并选择最佳的网格.

The DecisionTree algorithm has several hyperparameters, and tuning them to your data can improve accuracy. You can do this tuning using Spark's Cross Validation framework, which automatically tests a grid of hyperparameters and chooses the best.

以下是python测试3个maxBins [49,52,55]的示例

Here is example in python testing 3 maxBins [49, 52, 55]

dt = DecisionTreeClassifier(maxDepth=2, labelCol="indexed")
paramGrid = ParamGridBuilder().addGrid(dt.maxBins, [49, 52, 55]).addGrid(dt.maxDepth, [4, 6, 8]).addGrid(rf.impurity, ["entropy", "gini"]).build()
pipeline = Pipeline(stages=[labelIndexer, typeIndexer, assembler, dt])

这篇关于管道中的Spark MLLib 2.0分类功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆