tf.SequenceExample与多维数组 [英] tf.SequenceExample with multidimensional arrays

查看:71
本文介绍了tf.SequenceExample与多维数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Tensorflow中,我想将多维数组保存到TFRecord.例如:

In Tensorflow, I want to save a multidimensional array to a TFRecord. For example:

[[1, 2, 3], [1, 2], [3, 2, 1]]

由于我要解决的任务是连续的,因此我尝试使用Tensorflow的tf.train.SequenceExample(),并且在写入数据时,我已成功将数据写入TFRecord文件.但是,当我尝试使用tf.parse_single_sequence_example从TFRecord文件中加载数据时,会遇到大量的隐秘错误:

As the task I am trying to solve is sequential, I am trying to use Tensorflow's tf.train.SequenceExample() and when writing the data I am successful in writing the data to a TFRecord file. However, when I try to load the data from the TFRecord file using tf.parse_single_sequence_example, I am greeted with a large number of cryptic errors:

W tensorflow/core/framework/op_kernel.cc:936] Invalid argument: Name: , Key: input_characters, Index: 1.  Number of int64 values != expected.  values size: 6 but output shape: []
E tensorflow/core/client/tensor_c_api.cc:485] Name: , Key: input_characters, Index: 1.  Number of int64 values != expected.  values size: 6 but output shape: []

我用来尝试加载数据的功能如下:

The function I am using to try to load my data is below:

def read_and_decode_single_example(filename):

    filename_queue = tf.train.string_input_producer([filename],
                                                num_epochs=None)

    reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)

    context_features = {
         "length": tf.FixedLenFeature([], dtype=tf.int64)
    }

    sequence_features = {
         "input_characters": tf.FixedLenSequenceFeature([],           dtype=tf.int64),
         "output_characters": tf.FixedLenSequenceFeature([], dtype=tf.int64)
    }

    context_parsed, sequence_parsed = tf.parse_single_sequence_example(
    serialized=serialized_example,
    context_features=context_features,
    sequence_features=sequence_features
)

context = tf.contrib.learn.run_n(context_parsed, n=1, feed_dict=None)
print context

我用来保存数据的功能在这里:

The function that I am using to save the data is here:

# http://www.wildml.com/2016/08/rnns-in-tensorflow-a-practical-guide-and-undocumented-features/
def make_example(input_sequence, output_sequence):
    """
    Makes a single example from Python lists that follows the
    format of tf.train.SequenceExample.
    """

    example_sequence = tf.train.SequenceExample()

    # 3D length
    sequence_length = sum([len(word) for word in input_sequence])
    example_sequence.context.feature["length"].int64_list.value.append(sequence_length)

    input_characters = example_sequence.feature_lists.feature_list["input_characters"]
    output_characters = example_sequence.feature_lists.feature_list["output_characters"]

    for input_character, output_character in izip_longest(input_sequence,
                                                          output_sequence):

        # Extend seems to work, therefore it replaces append.
        if input_sequence is not None:
            input_characters.feature.add().int64_list.value.extend(input_character)

        if output_characters is not None:
            output_characters.feature.add().int64_list.value.extend(output_character)

    return example_sequence

任何帮助都将受到欢迎.

Any help would be welcomed.

推荐答案

我遇到了同样的问题.我认为这完全可以解决,但是您必须确定输出格式,然后弄清楚如何使用它.

I had the same problem. I think that it is entirely solveable, but you have to decide on the output format, and then figure out how you're going to use it.

第一 您的错误是什么?

错误消息告诉您您尝试读取的内容不符合您指定的功能大小.那么您在哪里指定呢?就在这里:

The error message is telling you that what you are trying to read doesn't fit into the feature size that you specified. So where did you specify it? Right here:

sequence_features = {
    "input_characters": tf.FixedLenSequenceFeature([], dtype=tf.int64),
    "output_characters": tf.FixedLenSequenceFeature([], dtype=tf.int64)
}

这说我的input_characters是一个单值序列",但这不是真的.您所拥有的是一系列由单个值组成的序列,因此是一个错误.

This says "my input_characters is a sequence of single values", but this is not true; what you have is a sequence of sequences of single values and hence an error.

第二 您可以做什么?

如果您改为使用:

a = [[1,2,3], [2,3,1], [3,2,1]] 
sequence_features = {
    "input_characters": tf.FixedLenSequenceFeature([3], dtype=tf.int64),
    "output_characters": tf.FixedLenSequenceFeature([3], dtype=tf.int64)
}

因为您已指定顶级序列的每个元素长为3个元素,所以您的代码不会出错.

You will not have an error with your code because you have specified that each element of the top level sequence is 3 elements long.

或者,如果您没有固定长度的序列,那么您将不得不使用其他类型的功能.

Alternatively, if you do not have fixed length sequences, then you're going to have to use a different type of feature.

sequence_features = {
    "input_characters": tf.VarLenFeature(tf.int64),
    "output_characters": tf.VarLenFeature(tf.int64)
}

VarLenFeature告诉它读取前的长度是未知的.不幸的是,这意味着您的input_characters不能再一步就被读取为密集向量.而是默认为 SparseTensor .您可以使用 tf.sparse_tensor_to_dense 将其变成密集的张量,例如:

The VarLenFeature tells it that the length is unknown before reading. Unfortunately this means that your input_characters can no longer be read as a dense vector in one step. Instead, it will be a SparseTensor by default. You can turn this into a dense tensor with tf.sparse_tensor_to_dense eg:

input_densified = tf.sparse_tensor_to_dense(sequence_parsed['input_characters'])

该文章,如果您的数据长度不总是相同,则您的词汇表中必须有一个"not_really_a_word"字样,并将其用作默认索引.例如假设您有一个索引0映射到"not_really_a_word"一词,然后使用您的

As mentioned in the article that you've been looking at, if your data does not always have the same length you will have to have a "not_really_a_word" word in your vocabulary, which you use as the default index. e.g. let's say you have index 0 mapping to the "not_really_a_word" word, then using your

a = [[1,2,3],  [2,3],  [3,2,1]]

python列表最终将成为一个

python list will end up being a

array((1,2,3),  (2,3,0),  (3,2,1))

张量.

被警告;我不确定像稀疏张量那样,反向传播对于SparseTensors是有效"的. wildml文章讨论每个序列的填充0掩盖了"not_actually_a_word"字的损失(请参阅:注意:在您的词汇/类中注意0").这似乎表明第一种方法将更易于实现.

Be warned; I'm not certain that back-propagation "just works" for SparseTensors, like it does for dense tensors. The wildml article talks about padding 0s per sequence masking the loss for the "not_actually_a_word" word (see: "SIDE NOTE: BE CAREFUL WITH 0’S IN YOUR VOCABULARY/CLASSES" in their article). This seems to suggest that the first method will be easier to implement.

请注意,这与此处描述的情况不同,在此情况下,每个示例都是一个序列序列.据我了解,这种方法没有得到很好支持的原因是因为它滥用了本应支持的情况;直接加载固定大小的嵌入.

Note that this is different to the case described here where each example is a sequence of sequences. To my understanding, the reason this kind of method is not well supported is because it is an abuse of the case that this is meant to support; loading fixed-size embeddings directly.

我将假设您要做的下一件事是将这些数字转换为单词嵌入.您可以使用tf.nn.embedding_lookup

I will assume that the very next thing you want to do is to turn those numbers into word embeddings. You can turn a list of indices into a list of embeddings with tf.nn.embedding_lookup

这篇关于tf.SequenceExample与多维数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆