H2O DRF看不见的分类值处理 [英] h2o DRF unseen categorical values handling

查看：88 发布时间：2020/11/22 1:11:29 random-forest h2o

本文介绍了H2O DRF看不见的分类值处理的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

用于DRF的文档状态

尝试在未分类的类别上进行预测时会发生什么在训练期间? DRF将新的分类级别转换为NA中的NA值测试集，然后在评分过程中在NA值上向左拆分.这算法在NA值上向左拆分，因为在训练过程中，NA值在最左侧的bin中与异常值分组在一起.

What happens when you try to predict on a categorical level not seen during training? DRF converts a new categorical level to a NA value in the test set, and then splits left on the NA value during scoring. The algorithm splits left on NA values because, during training, NA values are grouped with the outliers in the left-most bin.

问题:

因此，h2o将看不见的水平转换为NA，然后以与训练数据中的NA相同的方式对待它们.但是，如果训练数据中也没有NA怎么办?
假设我的分类预测变量为enum类型，应理解为非标准预测变量.那么"与离群值在最左边的bin中分组"是什么意思?如果预测变量为非常规变量，则不存在"最左端的"，也不存在"离群值".
我们将问题1和2放在一边，重点放在" 算法在NA值上向左拆分，因为在训练过程中，NA值在最左边的bin中与离群值分组在一起".这与此缺少值作为单独的类别[...]可以向左或向右"，请参见

So h2o converts unseen levels to NAs and then treats them the same way as NAs in the training data. But what if there are also no NAs in the training data?
Assume my categorical predictor is of enum type and to be understood as non-ordinal. What does "grouped with the outliers in the left-most bin" then mean? If the predictor is non-ordinal there is no "left-most" and there are no "outliers".
Let's put questions 1 and 2 aside and focus on the part "The algorithm splits left on NA values because, during training, NA values are grouped with the outliers in the left-most bin". This is in contradiction to this SO answer showing a single DRF tree derived from a MOJO. One can clearly see that NAs go left and right. It also contradicts the answer to another question in the documentation that says "missing values as a separate category [...] can go either left or right", see

算法在训练过程中如何处理缺失值? 值被解释为包含信息(例如，缺少原因)，而不是随机丢失.在造树期间，分裂通过最小化损失函数来找到每个节点的决策，并且将缺失值视为可以左移的单独类别还是对的.

How does the algorithm handle missing values during training? Missing values are interpreted as containing information (i.e., missing for a reason), rather than missing at random. During tree building, split decisions for every node are found by minimizing the loss function and treating missing values as a separate category that can go either left or right.

最后一点是建议而不是问题. 有关丢失值的文档GBM 说

The last point is more of a suggestion than a question. The documentation on missing values for GBM says

尝试在未分类的类别上进行预测时会发生什么在训练期间?看不见的分类级别变成了NA，并且因此遵循与NA相同的行为.如果没有NA 训练数据，然后是测试数据中看不见的分类级别多数方向(观察最多的方向).如果训练数据中有NA，然后是看不见的分类级别测试数据遵循最适合NA的方向训练数据.

What happens when you try to predict on a categorical level not seen during training? Unseen categorical levels are turned into NAs, and thus follow the same behavior as an NA. If there are no NAs in the training data, then unseen categorical levels in the test data follow the majority direction (the direction with the most observations). If there are NAs in the training data, then unseen categorical levels in the test data follow the direction that is optimal for the NAs of the training data.

与DRF如何处理缺失值的描述相反，这似乎是完全一致的.另外:使用多数路径而不是总是在拆分点处向左走更自然.

In contrast to the description of how DRF handles missing values, this seems to be completely consistent. Plus: using the majority path rather than always going left at split points appears to be more natural.

H2O DRF看不见的分类值处理 [英] h2o DRF unseen categorical values handling

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

H2O DRF看不见的分类值处理 [英] h2o DRF unseen categorical values handling

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭