通过feature_columns使用数据集API将免费文本功能添加到Tensorflow固定估计器中 [英] Getting free text features into Tensorflow Canned Estimators with Dataset API via feature_columns

查看:125
本文介绍了通过feature_columns使用数据集API将免费文本功能添加到Tensorflow固定估计器中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试建立一个给出reddit_score = f('subreddit','comment')

I'm trying to build a model that gives reddit_score = f('subreddit','comment')

主要是作为示例,然后我可以在其上进行工作项目.

Mainly this is as an example i can then build on for a work project.

我的代码是此处.

我的问题是我看到固定的估算器,例如 DNNLinearCombinedRegressor 必须具有FeatureColumn类一部分的feature_columns .

My problem is that i see that canned estimators e.g. DNNLinearCombinedRegressor must have feature_columns that are part of FeatureColumn class.

我有vocab文件,并且知道如果我只限于评论的第一个字,我可以做类似的事情

I have my vocab file and know that if i was to just limit to the first word of a comment i could just do something like

tf.feature_column.categorical_column_with_vocabulary_file(
        key='comment',
        vocabulary_file='{}/vocab.csv'.format(INPUT_DIR)
        )

但是,如果我要从评论中说出前10个字,那么我不确定如何从类似"this is a pre padded 10 word comment xyzpadxyz xyzpadxyz"的字符串转换为feature_column,这样我就可以构建嵌入以传递给deep具有广泛而深入的模型功能.

But if i'm passing in say first 10 words from a comment then i'm not sure how to go from a string like "this is a pre padded 10 word comment xyzpadxyz xyzpadxyz" to a feature_column such that i can then build an embedding to pass to the deep features in a wide and deep model.

似乎一定是真的很明显或很简单,但是我一生都找不到具有这种特殊设置的任何现有示例(宽和深罐头,数据集api以及混合功能,例如subreddit和raw文字功能,例如评论).

It seems like it must be something really obvious or simple but can't for life of me find any existing examples with this particular set up (canned wide and deep, dataset api, and a mix of features e.g subreddit and a raw text feature like comment).

我什至在考虑自己做vocab整数查找,这样我传入的comment功能将类似于[23,45,67,12,1,345,7,99,999,999],然后也许我可以得到它通过带有形状的numeric_feature输入,然后从那里进行一些处理.但这感觉有点奇怪.

I was even thinking about doing the vocab integer lookup myself such that the comment feature i pass in would be something like [23,45,67,12,1,345,7,99,999,999] and then maybe i could get it in via numeric_feature with a shape and then from there do something with it. But this feels a bit odd.

推荐答案

按照@Lak帖子中的方法添加答案,但对数据集api进行了一些修改.

Adding answer as per approach from the post @Lak did, but adapted a little for dataset api.

# Create an input function reading a file using the Dataset API
# Then provide the results to the Estimator API
def read_dataset(prefix, mode, batch_size):

    def _input_fn():

        def decode_csv(value_column):

            columns = tf.decode_csv(value_column, field_delim='|', record_defaults=DEFAULTS)
            features = dict(zip(CSV_COLUMNS, columns))

            features['comment_words'] = tf.string_split([features['comment']])
            features['comment_words'] = tf.sparse_tensor_to_dense(features['comment_words'], default_value=PADWORD)
            features['comment_padding'] = tf.constant([[0,0],[0,MAX_DOCUMENT_LENGTH]])
            features['comment_padded'] = tf.pad(features['comment_words'], features['comment_padding'])
            features['comment_sliced'] = tf.slice(features['comment_padded'], [0,0], [-1, MAX_DOCUMENT_LENGTH])
            features['comment_words'] = tf.pad(features['comment_sliced'], features['comment_padding'])
            features['comment_words'] = tf.slice(features['comment_words'],[0,0],[-1,MAX_DOCUMENT_LENGTH])

            features.pop('comment_padding')
            features.pop('comment_padded')
            features.pop('comment_sliced')

            label = features.pop(LABEL_COLUMN)

            return features, label

        # Use prefix to create file path
        file_path = '{}/{}*{}*'.format(INPUT_DIR, prefix, PATTERN)

        # Create list of files that match pattern
        file_list = tf.gfile.Glob(file_path)

        # Create dataset from file list
        dataset = (tf.data.TextLineDataset(file_list)  # Read text file
                    .map(decode_csv))  # Transform each elem by applying decode_csv fn

        tf.logging.info("...dataset.output_types={}".format(dataset.output_types))
        tf.logging.info("...dataset.output_shapes={}".format(dataset.output_shapes))

        if mode == tf.estimator.ModeKeys.TRAIN:

            num_epochs = None # indefinitely
            dataset = dataset.shuffle(buffer_size = 10 * batch_size)

        else:

            num_epochs = 1 # end-of-input after this

        dataset = dataset.repeat(num_epochs).batch(batch_size)

        return dataset.make_one_shot_iterator().get_next()

    return _input_fn

然后在下面的函数中,我们可以引用作为decode_csv()一部分的字段:

Then in below function we can reference the field we made as part of decode_csv() :

# Define feature columns
def get_wide_deep():

    EMBEDDING_SIZE = 10

    # Define column types
    subreddit = tf.feature_column.categorical_column_with_vocabulary_list('subreddit', ['news', 'ireland', 'pics'])

    comment_embeds = tf.feature_column.embedding_column(
        categorical_column = tf.feature_column.categorical_column_with_vocabulary_file(
            key='comment_words',
            vocabulary_file='{}/vocab.csv-00000-of-00001'.format(INPUT_DIR),
            vocabulary_size=100
            ),
        dimension = EMBEDDING_SIZE
        )

    # Sparse columns are wide, have a linear relationship with the output
    wide = [ subreddit ]

    # Continuous columns are deep, have a complex relationship with the output
    deep = [ comment_embeds ]

    return wide, deep

这篇关于通过feature_columns使用数据集API将免费文本功能添加到Tensorflow固定估计器中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆