如何为类别值列表的列创建嵌入 [英] How to create embeddedings for a column that is a list of categorical values

查看:41
本文介绍了如何为类别值列表的列创建嵌入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在确定如何为DNN模型的分类特征创建嵌入时遇到了一些麻烦.该功能由一组非固定的标签组成.

I am having some trouble deciding how to create embeddings for a categorical feature for my DNN model. The feature consists of a non fixed set of tags.

功能类似于:

column = [['Adventure','Animation','Comedy'],
          ['Adventure','Comedy'],
          ['Adventure','Children','Comedy']

我想用 tensorflow 做到这一点,所以我知道

I would like to do this with tensorflow so I know the tf.feature_column module should work, I just don't know which version to use.

谢谢!

推荐答案

首先,您需要以相同的长度填写要素.

First you need to fill in your features to the same length.

import itertools
import numpy as np

column = np.array(list(itertools.zip_longest(*column, fillvalue='UNK'))).T
print(column)

[['Adventure' 'Animation' 'Comedy']
 ['Adventure' 'Comedy' 'UNK']
 ['Adventure' 'Children' 'Comedy']]

然后,您可以使用 tf.feature_column.embedding_column 为分类特征创建嵌入. embedding_column 的输入必须是由任何 categorical_column _ * 函数创建的 CategoricalColumn .

Then you can use tf.feature_column.embedding_column to create embeddings for a categorical feature. The inputs of embedding_column must be a CategoricalColumn created by any of the categorical_column_* function.

# if you have big vocabulary list in files, you can use tf.feature_column.categorical_column_with_vocabulary_file
cat_fc = tf.feature_column.categorical_column_with_vocabulary_list(
    'cat_data', # identifying the input feature
    ['Adventure', 'Animation', 'Comedy', 'Children'], # vocabulary list
    dtype=tf.string,
    default_value=-1)

cat_column = tf.feature_column.embedding_column(
    categorical_column =cat_fc,
    dimension = 5,
    combiner='mean')

categorical_column_with_vocabulary_list 将忽略'UNK',因为词汇表中没有'UNK'. dimension 指定嵌入的尺寸, combiner 指定如何减少单行中是否有多个条目,并且平均"为 embedding_column 中的默认值

categorical_column_with_vocabulary_list will ignore the 'UNK' since there is no 'UNK' in vocabulary list. dimension specifying dimension of the embedding and combiner specifying how to reduce if there are multiple entries in a single row with 'mean' the default in embedding_column.

结果:

tensor = tf.feature_column.input_layer({'cat_data':column}, [cat_column])

with tf.Session() as session:
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    print(session.run(tensor))

[[-0.694761   -0.0711766   0.05720187  0.01770079 -0.09884425]
 [-0.8362482   0.11640486 -0.01767573 -0.00548441 -0.05738768]
 [-0.71162754 -0.03012567  0.15568805  0.00752804 -0.1422816 ]]

这篇关于如何为类别值列表的列创建嵌入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆