特征列嵌入查找 [英] Feature Columns Embedding lookup

查看:21
本文介绍了特征列嵌入查找的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在处理 tensorflow 中的数据集和 feature_columns(https://developers.googleblog.com/2017/11/introducing-tensorflow-feature-columns.html).我看到它们具有分类特征和一种从分类特征创建嵌入特征的方法.但是在处理 nlp 任务时,我们如何创建单个嵌入查找?

I have been working with the datasets and feature_columns in tensorflow(https://developers.googleblog.com/2017/11/introducing-tensorflow-feature-columns.html). I see they have categorical features and a way to create embedding features from categorical features. But when working on nlp tasks, how do we create a single embedding lookup?

例如:考虑文本分类任务.每个数据点都会有很多文本列,但它们不会是单独的类别.我们如何为所有这些列创建和使用单个嵌入查找?

For eg: Consider text classification task. Every data point would have a lot of textual columns but they would not be separate categories. How do we create and use a single embedding lookup for all these columns?

以下是我目前如何使用嵌入功能的示例.我正在为每一列构建一个分类特征,并使用它来创建嵌入.问题在于,对于不同的列,同一个词的嵌入可能不同.

Below is an example of how I am currently using the embedding features. I am building a categorical feature for each column and using that for creating embedding. The problem would be that the embeddings for same word could be different for different columns.

def create_embedding_features(key, vocab_list=None, embedding_size=20):
    cat_feature = \
        tf.feature_column.categorical_column_with_vocabulary_list(
            key=key,
            vocabulary_list = vocab_list
            )
    embedding_feature = tf.feature_column.embedding_column(
            categorical_column = cat_feature,
            dimension = embedding_size
        )
    return embedding_feature

le_features_embd = [create_embedding_features(f, vocab_list=vocab_list)
                     for f in feature_keys]

推荐答案

我觉得你有些误解.对于文本分类任务,如果您的输入是一段文本(一个句子),则应将整个句子视为单个特征列.因此,每个数据点只有一个文本列没有很多列.此列中的值通常是所有标记的组合嵌入.这就是我们将 var-length 稀疏特征(未知数量的文本标记)转换为一个密集特征(例如,一个固定的 256 维浮点向量)的方式.

I think you have some misunderstanding. For text classification task, if your input is a piece of text (a sentence), you should treat the entire sentence as a single feature column. Thus every data point has only a single textual column NOT a lot of columns. The value in this column is usually a combined embedding of all the tokens. That's the way we convert a var-length sparse feature (unknown number of text tokens) into one dense feature (e.g., a fixed 256 dimensional float vector).

让我们从 _CategoricalColumn 开始.

cat_column_with_vocab = tf.feature_column.categorical_column_with_vocabulary_list(
    key='my-text',
    vocabulary_list=vocab_list)

注意如果你的词汇量很大,你应该使用categorical_column_with_vocabulary_file.

Note if your vocabulary is huge, your should use categorical_column_with_vocabulary_file.

我们通过使用初始化程序从检查点(如果我们有预训练的嵌入)或随机化读取数据来创建嵌入列.

We create an embedding column by using an initializer to read from a checkpoint (if we have pre-trained embedding) or randomize.

embedding_initializer = None
if has_pretrained_embedding:     
  embedding_initializer=tf.contrib.framework.load_embedding_initializer(
        ckpt_path=xxxx)
else:
  embedding_initializer=tf.random_uniform_initializer(-1.0, 1.0)
embed_column = embedding_column(
    categorical_column=cat_column_with_vocab,
    dimension=256,   ## this is your pre-trained embedding dimension
    initializer=embedding_initializer,
    trainable=False)

假设你有另一个密集特征price:

Suppose you have another dense feature price:

price_column = tf.feature_column.numeric_column('price')

创建特征列

columns = [embed_column, price_column]

构建模型:

features = tf.parse_example(..., 
    features=make_parse_example_spec(columns))
dense_tensor = tf.feature_column.input_layer(features, columns)
for units in [128, 64, 32]:
  dense_tensor = tf.layers.dense(dense_tensor, units, tf.nn.relu)
prediction = tf.layers.dense(dense_tensor, 1)

顺便说一下,要使 tf.parse_example 工作,这里假设您的输入数据是 tf.Example 像这样(文本 protobuf):

By the way, for tf.parse_example to work, this assumes your input data is tf.Example like this (text protobuf):

features {
  feature {
    key: "price"
    value { float_list {
      value: 29.0
    }}
  }
  feature {
    key: "my-text"
    value { bytes_list {
      value: "this"
      value: "product"
      value: "is"
      value: "for sale"
      value: "within"
      value: "us"
    }}
  }
}

也就是说,我假设您有两种特征类型,一种是产品价格,另一种是产品的文字描述.您的词汇表将是

That is, I assume you have two feature types, one is the product price, and the other is the text description of the product. Your vocabulary list would be a superset of

["this", "product", "is", "for sale", "within", "us"].

这篇关于特征列嵌入查找的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆