H2O目标平均编码器“帧以相同顺序发送"错误 [英] H2O Target Mean Encoder "frames are being sent in the same order" ERROR

查看:50
本文介绍了H2O目标平均编码器“帧以相同顺序发送"错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在按照 H2O 示例在 Sparking Water(sparking water 2.4.2 和 H2O 3.22.04)中运行目标均值编码.它在以下所有段落中运行良好

I am following the H2O example to run target mean encoding in Sparking Water (sparking water 2.4.2 and H2O 3.22.04). It runs well in all the following paragraph

from h2o.targetencoder import TargetEncoder

# change label to factor
input_df_h2o['label'] = input_df_h2o['label'].asfactor()

# add fold column for Target Encoding
input_df_h2o["cv_fold_te"] = input_df_h2o.kfold_column(n_folds = 5, seed = 54321)

# find all categorical features
cat_features = [k for (k,v) in input_df_h2o.types.items() if v in ('string')]
# convert string to factor
for i in cat_features:
    input_df_h2o[i] = input_df_h2o[i].asfactor()

# target mean encode
targetEncoder = TargetEncoder(x= cat_features, y = y, fold_column = "cv_fold_te", blending_avg=True)
targetEncoder.fit(input_df_h2o)

但是当我开始使用用于拟合目标编码器的相同数据集来运行转换代码时(请参阅下面的代码):

But when I start to use the same data set used to fit Target Encoder to run the transform code (see code below):

ext_input_df_h2o = targetEncoder.transform(frame=input_df_h2o,
                                    holdout_type="kfold", # mean is calculating on out-of-fold data only; loo means leave one out
                                    is_train_or_valid=True,
                                    noise = 0, # determines if random noise should be added to the target average
                                    seed=54321)

我会有错误喜欢

Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-6773422589366407956.py", line 331, in <module>
    exec(code)
  File "<stdin>", line 5, in <module>
  File "/usr/lib/envs/env-1101-ver-1619-a-4.2.9-py-3.5.3/lib/python3.5/site-packages/h2o/targetencoder.py", line 97, in transform
    assert self._encodingMap.map_keys['string'] == self._teColumns
AssertionError

我在其源代码中找到了代码但如何解决这个问题?它与用于运行 fit 的表相同.

I found the code in its source code http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/_modules/h2o/targetencoder.html but how to fix this issue? It is the same table used to run the fit.

推荐答案

问题是因为您正在尝试对多个分类特征进行编码.我认为这是 H2O 的一个错误,但您可以解决将转换器放入循环遍历所有分类名称的问题.

The issue is because you are trying encoding multiple categorical features. I think that is a bug of H2O, but you can solve putting the transformer in a for loop that iterate over all categorical names.

import numpy as np
import pandas as pd
import h2o
from h2o.targetencoder import TargetEncoder
h2o.init()

df = pd.DataFrame({
    'x_0': ['a'] * 5 + ['b'] * 5,
    'x_1': ['c'] * 9 + ['d'] * 1,
    'x_2': ['a'] * 3 + ['b'] * 7,
    'y_0': [1, 1, 1, 1, 0, 1, 0, 0, 0, 0]
})

hf = h2o.H2OFrame(df)
hf['cv_fold_te'] = hf.kfold_column(n_folds=2, seed=54321)
hf['y_0'] = hf['y_0'].asfactor()
cat_features = ['x_0', 'x_1', 'x_2']

for item in cat_features:
    target_encoder = TargetEncoder(x=[item], y='y_0', fold_column = 'cv_fold_te')
    target_encoder.fit(hf)
    hf = target_encoder.transform(frame=hf, holdout_type='kfold',
                                  seed=54321, noise=0.0)
hf

这篇关于H2O目标平均编码器“帧以相同顺序发送"错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆