H2O DRF看不见的分类值处理 [英] h2o DRF unseen categorical values handling

查看:88
本文介绍了H2O DRF看不见的分类值处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

用于DRF的文档状态

尝试在未分类的类别上进行预测时会发生什么 在训练期间? DRF将新的分类级别转换为NA中的NA值 测试集,然后在评分过程中在NA值上向左拆分.这 算法在NA值上向左拆分,因为在训练过程中,NA值 在最左侧的bin中与异常值分组在一起.

What happens when you try to predict on a categorical level not seen during training? DRF converts a new categorical level to a NA value in the test set, and then splits left on the NA value during scoring. The algorithm splits left on NA values because, during training, NA values are grouped with the outliers in the left-most bin.

问题:

  1. 因此,h2o将看不见的水平转换为NA,然后以与训练数据中的NA相同的方式对待它们.但是,如果训练数据中也没有NA怎么办?
  2. 假设我的分类预测变量为enum类型,应理解为非标准预测变量.那么"与离群值在最左边的bin中分组"是什么意思?如果预测变量为非常规变量,则不存在"最左端的",也不存在"离群值".
  3. 我们将问题1和2放在一边,重点放在" 算法在NA值上向左拆分,因为在训练过程中,NA值 在最左边的bin中与离群值分组在一起".这与此缺少值作为单独的类别[...]可以向左或向右",请参见
  1. So h2o converts unseen levels to NAs and then treats them the same way as NAs in the training data. But what if there are also no NAs in the training data?
  2. Assume my categorical predictor is of enum type and to be understood as non-ordinal. What does "grouped with the outliers in the left-most bin" then mean? If the predictor is non-ordinal there is no "left-most" and there are no "outliers".
  3. Let's put questions 1 and 2 aside and focus on the part "The algorithm splits left on NA values because, during training, NA values are grouped with the outliers in the left-most bin". This is in contradiction to this SO answer showing a single DRF tree derived from a MOJO. One can clearly see that NAs go left and right. It also contradicts the answer to another question in the documentation that says "missing values as a separate category [...] can go either left or right", see

算法在训练过程中如何处理缺失值? 值被解释为包含信息(例如,缺少 原因),而不是随机丢失.在造树期间,分裂 通过最小化损失函数来找到每个节点的决策,并且 将缺失值视为可以左移的单独类别 还是对的.

How does the algorithm handle missing values during training? Missing values are interpreted as containing information (i.e., missing for a reason), rather than missing at random. During tree building, split decisions for every node are found by minimizing the loss function and treating missing values as a separate category that can go either left or right.

最后一点是建议而不是问题. 有关丢失值的文档GBM

The last point is more of a suggestion than a question. The documentation on missing values for GBM says

尝试在未分类的类别上进行预测时会发生什么 在训练期间?看不见的分类级别变成了NA,并且 因此遵循与NA相同的行为.如果没有NA 训练数据,然后是测试数据中看不见的分类级别 多数方向(观察最多的方向).如果 训练数据中有NA,然后是看不见的分类级别 测试数据遵循最适合NA的方向 训练数据.

What happens when you try to predict on a categorical level not seen during training? Unseen categorical levels are turned into NAs, and thus follow the same behavior as an NA. If there are no NAs in the training data, then unseen categorical levels in the test data follow the majority direction (the direction with the most observations). If there are NAs in the training data, then unseen categorical levels in the test data follow the direction that is optimal for the NAs of the training data.

与DRF如何处理缺失值的描述相反,这似乎是完全一致的.另外:使用多数路径而不是总是在拆分点处向左走更自然.

In contrast to the description of how DRF handles missing values, this seems to be completely consistent. Plus: using the majority path rather than always going left at split points appears to be more natural.

推荐答案

您指向的句子似乎与文档的其他部分相矛盾,实际上已经过时了.我做了一个吉拉票证,以正确的答案来更新常见问题解答(这是什么?您会在"GBM缺失值"部分看到-即GBM和DRF的缺失值处理相同.

The sentence you pointed to that seemed to contradict other portions of the docs, is in fact outdated. I have made a Jira Ticket to update the FAQ with the correct answer (which is what you see for the GBM missing values section - i.e. the missing value handling is the same for GBM and DRF).

请注意,枚举数据类型在内部被编码为数值,您可以在此处了解有关映射H2O类型的更多信息:

as a side note the enum data type are internally encoded as numeric values, you can learn more about the types of mapping's H2O can use here: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/categorical_encoding.html. For example, after the strings are mapped to integers for Enum, you can split {0, 1, 2, 3, 4, 5} as {0, 4, 5} and {1, 2, 3}.

或在此处查看h2o-3如何对分类进行分类:

Or take a look at how h2o-3 does binning for categoricals here: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm-faq/histograms_and_binning.html

这篇关于H2O DRF看不见的分类值处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆