管道中的 Spark MLLib 2.0 分类特征 [英] Spark MLLib 2.0 Categorical Features in pipeline

查看:29
本文介绍了管道中的 Spark MLLib 2.0 分类特征的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试基于日志文件构建决策树.一些特征集很大,包含数千个唯一值.我正在尝试在 Java 中使用管道和数据框的新习语.我为每个分类特征列构建了一个带有多个 StringIndexer 管道阶段的管道.然后我使用 VectorAssembler 创建一个特征向量.在 VectorAssembler 阶段之后,生成的数据框对我来说看起来很完美.我的管道看起来大约像

I'm trying to build a decision tree based on log files. Some feature sets are large containing thousands of unique values. I'm trying to use the new idioms of pipeline and data frame in Java. I've built a pipeline with several StringIndexer pipeline stages for each of the categorical feature columns. Then I use a VectorAssembler to create a features vector. The resultant data frame looks perfect to me after the VectorAssembler stage. My pipeline looks approximately like

StringIndexer-> StringIndexer-> StringIndexer->VectorAssembler->DecisionTreeClassifier

StringIndexer-> StringIndexer-> StringIndexer->VectorAssembler->DecisionTreeClassifier

但是我收到以下错误:

DecisionTree 要求 maxBins (= 32) 至少与每个分类特征中的值数量一样大,但分类特征 5 有 49 个值.考虑删除这个和其他具有大量值的分类特征,或添加更多训练示例.

DecisionTree requires maxBins (= 32) to be at least as large as the number of values in each categorical feature, but categorical feature 5 has 49 values. Considering remove this and other categorical features with a large number of values, or add more training examples.

我可以通过使用 Normalizer 来解决这个问题,但是生成的决策树无法满足我的需求,因为我需要使用原始特征值生成 DSL 决策树.我无法手动设置 maxBins,因为整个管道是一起执行的.我希望生成的决策树具有 StringIndexer 生成的值(例如,Feature 5 <= 132).此外,但不太重要的是,我希望能够为功能指定我自己的名称(例如,而不是功能 5",例如域")

I can resolve this issue by using a Normalizer, but then the resultant Decision tree is unusable for my needs, as I need to generate a DSL decision tree with the original feature values. I can't manually set the maxBins because the whole pipeline is executed together. I would like the resultant decision tree to have the StringIndexer generated values (e.g. Feature 5 <= 132). Additionally, but less important, I'd like to be able to specify my own names for the features (e.g. instead of 'Feature 5', say 'domain')

推荐答案

DecisionTree 算法采用单个 maxBins 值来决定要采用的分割数.默认值为 (=32).maxBins 应该大于或等于分类特征的最大类别数.由于您的功能 5 有 49 个不同的值,您需要将 maxBins 增加到 49 或更大.

The DecisionTree algorithm takes a single maxBins value to decide the number of splits to take. The default value is (=32). maxBins should be greater or equal to the maximum number of categories for categorical features. Since your feature 5 has 49 different values you need to increase maxBins to 49 or greater.

DecisionTree 算法有多个超参数,根据您的数据调整它们可以提高准确性.您可以使用 Spark 的交叉验证框架进行此调整,该框架会自动测试超参数网格并选择最佳参数.

The DecisionTree algorithm has several hyperparameters, and tuning them to your data can improve accuracy. You can do this tuning using Spark's Cross Validation framework, which automatically tests a grid of hyperparameters and chooses the best.

这里是 python 测试 3 个 maxBins [49, 52, 55] 的例子

Here is example in python testing 3 maxBins [49, 52, 55]

dt = DecisionTreeClassifier(maxDepth=2, labelCol="indexed")
paramGrid = ParamGridBuilder().addGrid(dt.maxBins, [49, 52, 55]).addGrid(dt.maxDepth, [4, 6, 8]).addGrid(rf.impurity, ["entropy", "gini"]).build()
pipeline = Pipeline(stages=[labelIndexer, typeIndexer, assembler, dt])

这篇关于管道中的 Spark MLLib 2.0 分类特征的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆