如何为作为分类值列表的列创建嵌入 [英] How to create embeddedings for a column that is a list of categorical values

查看:22
本文介绍了如何为作为分类值列表的列创建嵌入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在决定如何为我的 DNN 模型的分类特征创建嵌入时遇到了一些麻烦.该功能由一组非固定的标签组成.

I am having some trouble deciding how to create embeddings for a categorical feature for my DNN model. The feature consists of a non fixed set of tags.

功能如下:

column = [['Adventure','Animation','Comedy'],
          ['Adventure','Comedy'],
          ['Adventure','Children','Comedy']

我想用 tensorflow 做到这一点,所以我知道 tf.feature_column 模块应该可以工作,我只是不知道该使用哪个版本.

I would like to do this with tensorflow so I know the tf.feature_column module should work, I just don't know which version to use.

谢谢!

推荐答案

首先你需要把你的特征填到相同的长度.

First you need to fill in your features to the same length.

import itertools
import numpy as np

column = np.array(list(itertools.zip_longest(*column, fillvalue='UNK'))).T
print(column)

[['Adventure' 'Animation' 'Comedy']
 ['Adventure' 'Comedy' 'UNK']
 ['Adventure' 'Children' 'Comedy']]

然后你可以使用tf.feature_column.embedding_column 为分类特征创建嵌入.embedding_column 的输入必须是由任何 categorical_column_* 函数创建的 CategoricalColumn.

Then you can use tf.feature_column.embedding_column to create embeddings for a categorical feature. The inputs of embedding_column must be a CategoricalColumn created by any of the categorical_column_* function.

# if you have big vocabulary list in files, you can use tf.feature_column.categorical_column_with_vocabulary_file
cat_fc = tf.feature_column.categorical_column_with_vocabulary_list(
    'cat_data', # identifying the input feature
    ['Adventure', 'Animation', 'Comedy', 'Children'], # vocabulary list
    dtype=tf.string,
    default_value=-1)

cat_column = tf.feature_column.embedding_column(
    categorical_column =cat_fc,
    dimension = 5,
    combiner='mean')

categorical_column_with_vocabulary_list 将忽略 'UNK',因为词汇表中没有 'UNK'.dimension 指定嵌入的维度和 combiner 指定如果单行中有多个条目时如何减少 embedding_column 中的默认值mean".

categorical_column_with_vocabulary_list will ignore the 'UNK' since there is no 'UNK' in vocabulary list. dimension specifying dimension of the embedding and combiner specifying how to reduce if there are multiple entries in a single row with 'mean' the default in embedding_column.

结果:

tensor = tf.feature_column.input_layer({'cat_data':column}, [cat_column])

with tf.Session() as session:
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    print(session.run(tensor))

[[-0.694761   -0.0711766   0.05720187  0.01770079 -0.09884425]
 [-0.8362482   0.11640486 -0.01767573 -0.00548441 -0.05738768]
 [-0.71162754 -0.03012567  0.15568805  0.00752804 -0.1422816 ]]

这篇关于如何为作为分类值列表的列创建嵌入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆