Tensorflow分类编码中多余列的逻辑是什么? [英] What is the logic of the extra columns in Tensorflow categorical encoding?

查看:67
本文介绍了Tensorflow分类编码中多余列的逻辑是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在关注预处理层的官方Tensorflow教程,但我没有我肯定会明白为什么我最终在分类编码后得到了这些额外的列.

I am following the official Tensorflow tutorial for preprocessing layers, and I am not sure I get why I end up getting these extra columns after the categorical encoding.

这是一个精简的可重现的最小示例(包括数据):

Here is a stripped-down minimal reproducible example (including the data):

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
import pathlib

dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'
csv_file = 'datasets/petfinder-mini/petfinder-mini.csv'
tf.keras.utils.get_file('petfinder_mini.zip', dataset_url, extract=True, cache_dir='.')
df = pd.read_csv(csv_file)

# In the original dataset "4" indicates the pet was not adopted.
df['target'] = np.where(df['AdoptionSpeed']==4, 0, 1)
# Drop un-used columns.
df = df.drop(columns=['AdoptionSpeed', 'Description'])

# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('target')
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  ds = ds.prefetch(batch_size)
  return ds

batch_size = 5
ds = df_to_dataset(df, batch_size=batch_size)
[(train_features, label_batch)] = ds.take(1)

def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
  # Create a StringLookup layer which will turn strings into integer indices
  if dtype == 'string':
    index = preprocessing.StringLookup(max_tokens=max_tokens)
  else:
    index = preprocessing.IntegerLookup(max_values=max_tokens)

  # Prepare a Dataset that only yields our feature
  feature_ds = dataset.map(lambda x, y: x[name])

  # Learn the set of possible values and assign them a fixed integer index.
  index.adapt(feature_ds)

  # Create a Discretization for our integer indices.
  encoder = preprocessing.CategoryEncoding(max_tokens=index.vocab_size())
  #encoder = preprocessing.CategoryEncoding(max_tokens=2)

  # Prepare a Dataset that only yields our feature.
  feature_ds = feature_ds.map(index)

  # Learn the space of possible indices.
  encoder.adapt(feature_ds)

  # Apply one-hot encoding to our indices. The lambda function captures the
  # layer so we can use them, or include them in the functional model later.
  return lambda feature: encoder(index(feature))

所以,运行后

type_col = train_features['Type']
layer = get_category_encoding_layer('Type', ds, 'string')
layer(type_col)

我得到的结果是:

<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.]], dtype=float32)>

确实类似于教程中显示的内容.

similar to what is shown in the tutorial indeed.

请注意,这是一个二进制分类问题(猫/狗):

Notice that this is a binary classification problem (Cat/Dog):

np.unique(type_col)
# array([b'Cat', b'Dog'], dtype=object)

那么,在上面的结果中显示的分类编码之后,另外2列的逻辑是什么?它们代表什么?为什么是2(而不是1、3或更多)?

So, what is the logic of the 2 extra columns after the categorical encoding shown in the result above? What do they represent, and why they are 2 (and not, say, 1, or 3, or more)?

(我很清楚,如果我希望进行简单的一键编码,我可以简单地使用 to_categorical(),但这不是这里的问题)

(I am perfectly aware that, should I wish for a simple one-hot encoding, I could simply use to_categorical(), but this is not the question here)

推荐答案

正如问题中已经暗示的那样,分类编码比简单的一键编码更为丰富.要查看这两列所代表的含义,只需在 get_category_encoding_layer()函数内部的某处添加诊断 print :

As already implied in the question, categorical encoding is somewhat richer that simple one-hot encoding. To see what these two columns represent it suffices to add a diagnostic print somewhere inside the get_category_encoding_layer() function:

print(index.get_vocabulary())

那么最后一条命令的结果将是:

Then the result of the last commands will be:

['', '[UNK]', 'Dog', 'Cat']
<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.]], dtype=float32)>

希望应该清楚提示:这里的额外两列分别代表空值''和未知值'[UNK]',其中可能会出现在未来(看不见的)数据中.

The hint should hopefully be clear: the extra two columns here represent the empty value '' and unknown ones '[UNK]', respectively, which could be present in future (unseen) data.

这实际上是由默认参数(不是 CategoryEncoding ,而是前面的 StringLookup )确定的;来自文档:

This is actually determined from the default arguments, not of CategoryEncoding, but of the preceding StringLookup; from the docs:

mask_token=''
oov_token='[UNK]'

通过请求 oov_token =''而不是 oov_token ='[UNK]'代码>;用

替换 get_category_encoding_layer()函数中对 StringLookup 的调用

You can end up with a somewhat more tight encoding (only 1 extra column instead of 2) by asking for oov_token='' instead of oov_token='[UNK]'; replace the call to StringLookup in the get_category_encoding_layer() function with

    index = preprocessing.StringLookup(oov_token='',mask_token=None, max_tokens=max_tokens)

之后,结果将是:

['', 'Dog', 'Cat']
<tf.Tensor: shape=(5, 3), dtype=float32, numpy=
array([[0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.]], dtype=float32)>

即只有3列(没有专门用于'[UNK]'的列).AFAIK,这是您可以采取的最低操作-尝试将 mask_token oov_token 都设置为 None 都会导致错误.

i.e. with only 3 columns (without a dedicated one for '[UNK]'). AFAIK, this is the lowest you can go - attempting to set both mask_token and oov_token to None will result to an error.

这篇关于Tensorflow分类编码中多余列的逻辑是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆