如何为作为分类值列表的列创建嵌入 [英] How to create embeddedings for a column that is a list of categorical values
问题描述
我在决定如何为我的 DNN 模型的分类特征创建嵌入时遇到了一些麻烦.该功能由一组非固定的标签组成.
I am having some trouble deciding how to create embeddings for a categorical feature for my DNN model. The feature consists of a non fixed set of tags.
功能如下:
column = [['Adventure','Animation','Comedy'],
['Adventure','Comedy'],
['Adventure','Children','Comedy']
我想用 tensorflow
做到这一点,所以我知道 tf.feature_column 模块应该可以工作,我只是不知道该使用哪个版本.
I would like to do this with tensorflow
so I know the tf.feature_column module should work, I just don't know which version to use.
谢谢!
推荐答案
首先你需要把你的特征填到相同的长度.
First you need to fill in your features to the same length.
import itertools
import numpy as np
column = np.array(list(itertools.zip_longest(*column, fillvalue='UNK'))).T
print(column)
[['Adventure' 'Animation' 'Comedy']
['Adventure' 'Comedy' 'UNK']
['Adventure' 'Children' 'Comedy']]
然后你可以使用tf.feature_column.embedding_column
为分类特征创建嵌入.embedding_column
的输入必须是由任何 categorical_column_*
函数创建的 CategoricalColumn
.
Then you can use tf.feature_column.embedding_column
to create embeddings for a categorical feature. The inputs of embedding_column
must be a CategoricalColumn
created by any of the categorical_column_*
function.
# if you have big vocabulary list in files, you can use tf.feature_column.categorical_column_with_vocabulary_file
cat_fc = tf.feature_column.categorical_column_with_vocabulary_list(
'cat_data', # identifying the input feature
['Adventure', 'Animation', 'Comedy', 'Children'], # vocabulary list
dtype=tf.string,
default_value=-1)
cat_column = tf.feature_column.embedding_column(
categorical_column =cat_fc,
dimension = 5,
combiner='mean')
categorical_column_with_vocabulary_list
将忽略 'UNK'
,因为词汇表中没有 'UNK'
.dimension
指定嵌入的维度和 combiner
指定如果单行中有多个条目时如何减少 embedding_column
中的默认值mean".
categorical_column_with_vocabulary_list
will ignore the 'UNK'
since there is no 'UNK'
in vocabulary list. dimension
specifying dimension of the embedding and combiner
specifying how to reduce if there are multiple entries in a single row with 'mean' the default in embedding_column
.
结果:
tensor = tf.feature_column.input_layer({'cat_data':column}, [cat_column])
with tf.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
print(session.run(tensor))
[[-0.694761 -0.0711766 0.05720187 0.01770079 -0.09884425]
[-0.8362482 0.11640486 -0.01767573 -0.00548441 -0.05738768]
[-0.71162754 -0.03012567 0.15568805 0.00752804 -0.1422816 ]]
这篇关于如何为作为分类值列表的列创建嵌入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!