H2O目标均值编码器“以相同顺序发送帧".错误 [英] H2O Target Mean Encoder "frames are being sent in the same order" ERROR

查看:72
本文介绍了H2O目标均值编码器“以相同顺序发送帧".错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在跟踪H2O示例,以在Sparking Water(火花水2.4.2和H2O 3.22.04)中运行目标均值编码.在以下所有段落中运行良好

I am following the H2O example to run target mean encoding in Sparking Water (sparking water 2.4.2 and H2O 3.22.04). It runs well in all the following paragraph

from h2o.targetencoder import TargetEncoder

# change label to factor
input_df_h2o['label'] = input_df_h2o['label'].asfactor()

# add fold column for Target Encoding
input_df_h2o["cv_fold_te"] = input_df_h2o.kfold_column(n_folds = 5, seed = 54321)

# find all categorical features
cat_features = [k for (k,v) in input_df_h2o.types.items() if v in ('string')]
# convert string to factor
for i in cat_features:
    input_df_h2o[i] = input_df_h2o[i].asfactor()

# target mean encode
targetEncoder = TargetEncoder(x= cat_features, y = y, fold_column = "cv_fold_te", blending_avg=True)
targetEncoder.fit(input_df_h2o)

但是当我开始使用用于调整目标编码器的相同数据集来运行转换代码时(请参见下面的代码):

But when I start to use the same data set used to fit Target Encoder to run the transform code (see code below):

ext_input_df_h2o = targetEncoder.transform(frame=input_df_h2o,
                                    holdout_type="kfold", # mean is calculating on out-of-fold data only; loo means leave one out
                                    is_train_or_valid=True,
                                    noise = 0, # determines if random noise should be added to the target average
                                    seed=54321)

我将遇到错误,如

Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-6773422589366407956.py", line 331, in <module>
    exec(code)
  File "<stdin>", line 5, in <module>
  File "/usr/lib/envs/env-1101-ver-1619-a-4.2.9-py-3.5.3/lib/python3.5/site-packages/h2o/targetencoder.py", line 97, in transform
    assert self._encodingMap.map_keys['string'] == self._teColumns
AssertionError

我在其源代码中找到了代码

I found the code in its source code http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/_modules/h2o/targetencoder.html but how to fix this issue? It is the same table used to run the fit.

推荐答案

问题是因为您正在尝试对多个分类特征进行编码.我认为这是H2O的错误,但是您可以解决将转换器置于对所有分类名称进行迭代的for循环中的问题.

The issue is because you are trying encoding multiple categorical features. I think that is a bug of H2O, but you can solve putting the transformer in a for loop that iterate over all categorical names.

import numpy as np
import pandas as pd
import h2o
from h2o.targetencoder import TargetEncoder
h2o.init()

df = pd.DataFrame({
    'x_0': ['a'] * 5 + ['b'] * 5,
    'x_1': ['c'] * 9 + ['d'] * 1,
    'x_2': ['a'] * 3 + ['b'] * 7,
    'y_0': [1, 1, 1, 1, 0, 1, 0, 0, 0, 0]
})

hf = h2o.H2OFrame(df)
hf['cv_fold_te'] = hf.kfold_column(n_folds=2, seed=54321)
hf['y_0'] = hf['y_0'].asfactor()
cat_features = ['x_0', 'x_1', 'x_2']

for item in cat_features:
    target_encoder = TargetEncoder(x=[item], y='y_0', fold_column = 'cv_fold_te')
    target_encoder.fit(hf)
    hf = target_encoder.transform(frame=hf, holdout_type='kfold',
                                  seed=54321, noise=0.0)
hf

这篇关于H2O目标均值编码器“以相同顺序发送帧".错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆