当Seq2Seq网络在输出中一遍又一遍地重复单词时该怎么办? [英] What to do when Seq2Seq network repeats words over and over in output?

查看:109
本文介绍了当Seq2Seq网络在输出中一遍又一遍地重复单词时该怎么办?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,我已经在一个项目上工作了一段时间,我们有非常少的数据,我知道如果我们能够建立一个更大的数据集,它将变得更好.除此之外,我目前的问题是当我有一个句子输入时,我的输出现在看起来像这样:

So, I've been working on a project for a while, we have very little data, I know it would become much better if we were able to put together a much much larger dataset. That aside, my issue at the moment is when I have a sentence input, my outputs look like this right now:

contactid contactid contactid contactid

contactid contactid contactid contactid

一个单词集中在一个单词上,并一遍又一遍地重复.我该怎么做才能克服这一障碍?

A single word is focused on and repeated over and over again. What can I do to overcome this hurdle?

我尝试过的事情:

  1. 再次检查我是否添加了开始/停止令牌,并确保将令牌正确放置在其vocab文件的顶部,我正在共享vocab.
  2. 我发现有话说,这可能是由于单词嵌入效果不佳所致.为此,我检查了张量板,并确保足够的PCA显示出非常密集的点簇.看到我抓住了Facebook的公开训练有素的单词向量,并将它们作为嵌入程序加载.再次训练,这次张量板PCA显示了更好的画面.
  3. 将我的训练计划程序从基本计划切换为SampledScheduling,以偶尔将训练输出替换为基本情况.
  4. 切换我的解码器以使用波束搜索解码器,我认为如果单词选择在中间特征空间中靠得很近,这可能会给出更可靠的响应.

可以肯定的是,我的困惑在不断减少.

For certain my perplexity is steadily decreasing.

这是我的数据集准备代码:

Here is my dataset preperation code:

class ModelInputs(object):
"""Factory to construct various input hooks and functions depending on mode """

def __init__(
    self, vocab_files, batch_size,
    share_vocab=True, src_eos_id=1, tgt_eos_id=2
):
    self.batch_size = batch_size
    self.vocab_files = vocab_files
    self.share_vocab = share_vocab
    self.src_eos_id = src_eos_id
    self.tgt_eos_id = tgt_eos_id

def get_inputs(self, file_path, num_infer=None, mode=tf.estimator.ModeKeys.TRAIN):
    self.mode = mode
    if self.mode == tf.estimator.ModeKeys.TRAIN:
        return self._training_input_hook(file_path)
    if self.mode == tf.estimator.ModeKeys.EVAL:
        return self._validation_input_hook(file_path)
    if self.mode == tf.estimator.ModeKeys.PREDICT:
        if num_infer is None:
            raise ValueError('If performing inference must supply number of predictions to be made.')
        return self._infer_input_hook(file_path, num_infer)

def _prepare_data(self, dataset, out=False):
    prep_set = dataset.map(lambda string: tf.string_split([string]).values)
    prep_set = prep_set.map(lambda words: (words, tf.size(words)))
    if out == True:
        return prep_set.map(lambda words, size: (self.vocab_tables[1].lookup(words), size))
    return prep_set.map(lambda words, size: (self.vocab_tables[0].lookup(words), size))

def _batch_data(self, dataset, src_eos_id, tgt_eos_id):
    batched_set = dataset.padded_batch(
            self.batch_size,
            padded_shapes=((tf.TensorShape([None]), tf.TensorShape([])), (tf.TensorShape([None]), tf.TensorShape([]))),
            padding_values=((src_eos_id, 0), (tgt_eos_id, 0))
    )
    return batched_set

def _batch_infer_data(self, dataset, src_eos_id):
    batched_set = dataset.padded_batch(
        self.batch_size,
        padded_shapes=(tf.TensorShape([None]), tf.TensorShape([])),
        padding_values=(src_eos_id, 0)
    )
    return batched_set

def _create_vocab_tables(self, vocab_files, share_vocab=False):
    if vocab_files[1] is None and share_vocab == False:
        raise ValueError('If share_vocab is set to false must provide target vocab. (src_vocab_file, \
                target_vocab_file)')

    src_vocab_table = lookup_ops.index_table_from_file(
        vocab_files[0],
        default_value=UNK_ID
    )

    if share_vocab:
        tgt_vocab_table = src_vocab_table
    else:
        tgt_vocab_table = lookup_ops.index_table_from_file(
            vocab_files[1],
            default_value=UNK_ID
        )

    return src_vocab_table, tgt_vocab_table

def _prepare_iterator_hook(self, hook, scope_name, iterator, file_path, name_placeholder):
    if self.mode == tf.estimator.ModeKeys.TRAIN or self.mode == tf.estimator.ModeKeys.EVAL:
        feed_dict = {
                name_placeholder[0]: file_path[0],
                name_placeholder[1]: file_path[1]
        }
    else:
        feed_dict = {name_placeholder: file_path}

    with tf.name_scope(scope_name):
        hook.iterator_initializer_func = \
                lambda sess: sess.run(
                    iterator.initializer,
                    feed_dict=feed_dict,
                )

def _set_up_train_or_eval(self, scope_name, file_path):
    hook = IteratorInitializerHook()
    def input_fn():
        with tf.name_scope(scope_name):
            with tf.name_scope('sentence_markers'):
                src_eos_id = tf.constant(self.src_eos_id, dtype=tf.int64)
                tgt_eos_id = tf.constant(self.tgt_eos_id, dtype=tf.int64)
            self.vocab_tables = self._create_vocab_tables(self.vocab_files, self.share_vocab)
            in_file = tf.placeholder(tf.string, shape=())
            in_dataset = self._prepare_data(tf.contrib.data.TextLineDataset(in_file).repeat(None))
            out_file = tf.placeholder(tf.string, shape=())
            out_dataset = self._prepare_data(tf.contrib.data.TextLineDataset(out_file).repeat(None))
            dataset = tf.contrib.data.Dataset.zip((in_dataset, out_dataset))
            dataset = self._batch_data(dataset, src_eos_id, tgt_eos_id)
            iterator = dataset.make_initializable_iterator()
            next_example, next_label = iterator.get_next()
            self._prepare_iterator_hook(hook, scope_name, iterator, file_path, (in_file, out_file))
            return next_example, next_label

    return (input_fn, hook)

def _training_input_hook(self, file_path):
    input_fn, hook = self._set_up_train_or_eval('train_inputs', file_path)

    return (input_fn, hook)

def _validation_input_hook(self, file_path):
    input_fn, hook = self._set_up_train_or_eval('eval_inputs', file_path)

    return (input_fn, hook)

def _infer_input_hook(self, file_path, num_infer):
    hook = IteratorInitializerHook()

    def input_fn():
        with tf.name_scope('infer_inputs'):
            with tf.name_scope('sentence_markers'):
                src_eos_id = tf.constant(self.src_eos_id, dtype=tf.int64)
            self.vocab_tables = self._create_vocab_tables(self.vocab_files, self.share_vocab)
            infer_file = tf.placeholder(tf.string, shape=())
            dataset = tf.contrib.data.TextLineDataset(infer_file)
            dataset = self._prepare_data(dataset)
            dataset = self._batch_infer_data(dataset, src_eos_id)
            iterator = dataset.make_initializable_iterator()
            next_example, seq_len = iterator.get_next()
            self._prepare_iterator_hook(hook, 'infer_inputs', iterator, file_path, infer_file)
            return ((next_example, seq_len), None)

    return (input_fn, hook)

这是我的模特:

class Seq2Seq():

def __init__(
    self, batch_size, inputs,
    outputs, inp_vocab_size, tgt_vocab_size,
    embed_dim, mode, time_major=False,
    enc_embedding=None, dec_embedding=None, average_across_batch=True,
    average_across_timesteps=True, vocab_path=None, embedding_path='./data_files/wiki.simple.vec'
):
    embed_np = self._get_embedding(embedding_path)
    if not enc_embedding:
        self.enc_embedding = tf.contrib.layers.embed_sequence(
            inputs,
            inp_vocab_size,
            embed_dim,
            trainable=True,
            scope='embed',
            initializer=tf.constant_initializer(value=embed_np, dtype=tf.float32)
        )
    else:
        self.enc_embedding = enc_embedding
    if mode == tf.estimator.ModeKeys.TRAIN or mode == tf.estimator.ModeKeys.EVAL:
        if not dec_embedding:
            embed_outputs = tf.contrib.layers.embed_sequence(
                outputs,
                tgt_vocab_size,
                embed_dim,
                trainable=True,
                scope='embed',
                reuse=True
            )
            with tf.variable_scope('embed', reuse=True):
                dec_embedding = tf.get_variable('embeddings')
            self.embed_outputs = embed_outputs
            self.dec_embedding = dec_embedding

        else:
            self.dec_embedding = dec_embedding
    else:
        with tf.variable_scope('embed', reuse=True):
            self.dec_embedding = tf.get_variable('embeddings')

    if mode == tf.estimator.ModeKeys.PREDICT and vocab_path is None:
        raise ValueError('If mode is predict, must supply vocab_path')
    self.vocab_path = vocab_path
    self.inp_vocab_size = inp_vocab_size
    self.tgt_vocab_size = tgt_vocab_size
    self.average_across_batch = average_across_batch
    self.average_across_timesteps = average_across_timesteps
    self.time_major = time_major
    self.batch_size = batch_size
    self.mode = mode

def _get_embedding(self, embedding_path):
    model = KeyedVectors.load_word2vec_format(embedding_path)
    vocab = model.vocab
    vocab_len = len(vocab)
    return np.array([model.word_vec(k) for k in vocab.keys()])

def _get_lstm(self, num_units):
    return tf.nn.rnn_cell.BasicLSTMCell(num_units)

def encode(self, num_units, num_layers, seq_len, cell_fw=None, cell_bw=None):
    if cell_fw and cell_bw:
        fw_cell = cell_fw
        bw_cell = cell_bw
    else:
        fw_cell = self._get_lstm(num_units)
        bw_cell = self._get_lstm(num_units)
    encoder_outputs, bi_encoder_state = tf.nn.bidirectional_dynamic_rnn(
        fw_cell,
        bw_cell,
        self.enc_embedding,
        sequence_length=seq_len,
        time_major=self.time_major,
        dtype=tf.float32
    )
    c_state = tf.concat([bi_encoder_state[0].c, bi_encoder_state[1].c], axis=1)
    h_state = tf.concat([bi_encoder_state[0].h, bi_encoder_state[1].h], axis=1)
    encoder_state = tf.contrib.rnn.LSTMStateTuple(c=c_state, h=h_state)
    return tf.concat(encoder_outputs, -1), encoder_state

def _train_decoder(self, decoder_cell, out_seq_len, encoder_state, helper):
    if not helper:
        helper = tf.contrib.seq2seq.ScheduledEmbeddingTrainingHelper(
            self.embed_outputs,
            out_seq_len,
            self.dec_embedding,
            0.3,
        )
        # helper = tf.contrib.seq2seq.TrainingHelper(
        #     self.dec_embedding,
        #     out_seq_len,
        # )
    projection_layer = layers_core.Dense(self.tgt_vocab_size, use_bias=False)
    decoder = tf.contrib.seq2seq.BasicDecoder(
        decoder_cell,
        helper,
        encoder_state,
        output_layer=projection_layer
    )
    return decoder

def _predict_decoder(self, cell, encoder_state, beam_width, length_penalty_weight):
    tiled_encoder_state = tf.contrib.seq2seq.tile_batch(
        encoder_state, multiplier=beam_width
    )
    with tf.name_scope('sentence_markers'):
        sos_id = tf.constant(1, dtype=tf.int32)
        eos_id = tf.constant(2, dtype=tf.int32)
    start_tokens = tf.fill([self.batch_size], sos_id)
    end_token = eos_id
    projection_layer = layers_core.Dense(self.tgt_vocab_size, use_bias=False)
    emb = tf.squeeze(self.dec_embedding)
    decoder = tf.contrib.seq2seq.BeamSearchDecoder(
        cell=cell,
        embedding=self.dec_embedding,
        start_tokens=start_tokens,
        end_token=end_token,
        initial_state=tiled_encoder_state,
        beam_width=beam_width,
        output_layer=projection_layer,
        length_penalty_weight=length_penalty_weight
    )
    return decoder

def decode(
    self, num_units, out_seq_len,
    encoder_state, cell=None, helper=None,
    beam_width=None, length_penalty_weight=None
):
    with tf.name_scope('Decode'):
        if cell:
            decoder_cell = cell
        else:
            decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(2*num_units)
        if self.mode != estimator.ModeKeys.PREDICT:
            decoder = self._train_decoder(decoder_cell, out_seq_len, encoder_state, helper)
        else:
            decoder = self._predict_decoder(decoder_cell, encoder_state, beam_width, length_penalty_weight)
        outputs = tf.contrib.seq2seq.dynamic_decode(
            decoder,
            maximum_iterations=20,
            swap_memory=True,
        )
        outputs = outputs[0]
        if self.mode != estimator.ModeKeys.PREDICT:
            return outputs.rnn_output, outputs.sample_id
        else:
            return outputs.beam_search_decoder_output, outputs.predicted_ids

def prepare_predict(self, sample_id):
    rev_table = lookup_ops.index_to_string_table_from_file(
        self.vocab_path, default_value=UNK)
    predictions = rev_table.lookup(tf.to_int64(sample_id))
    return tf.estimator.EstimatorSpec(
        predictions=predictions,
        mode=tf.estimator.ModeKeys.PREDICT
    )

def prepare_train_eval(
    self, t_out,
    out_seq_len, labels, lr,
    train_op=None, loss=None
):
    if not loss:
        weights = tf.sequence_mask(
            out_seq_len,
            dtype=t_out.dtype
        )
        loss = tf.contrib.seq2seq.sequence_loss(
            t_out,
            labels,
            weights,
            average_across_batch=self.average_across_batch,
        )

    if not train_op:
        train_op = tf.contrib.layers.optimize_loss(
            loss,
            tf.train.get_global_step(),
            optimizer='SGD',
            learning_rate=lr,
            summaries=['loss', 'learning_rate']
        )

    return tf.estimator.EstimatorSpec(
        mode=self.mode,
        loss=loss,
        train_op=train_op,
    )

推荐答案

这种重复称为文本退化" .

2019年有一篇出色的论文对此现象进行了分析: 神经文本退化的奇怪案例 ,由 Ari Holtzman 等人撰写.来自艾伦人工智能研究所.

There is a great paper from 2019 which analyse this phenomenon: The Curious Case of Neural Text Degeneration by Ari Holtzman et al. from the Allen Institute for Artificial Intelligence.

重复可能来自解码器站点上的文本搜索(文本采样)类型.许多人只是通过模型提出的最可能的下一个世界(最后一层的softmax上的argmax)或所谓的波束搜索来实现此目的.实际上,光束搜索是当今的行业标准.

The repetition may come from the type of text search (text sampling) on the decoder site. Many people implement this just by the most probable next world proposed by the model (argmax on the softmax on the last layer) or by so called beam search. In fact the beam search is the industry standard for today.

这是文章中的Beam搜索示例:

This is the example of Beam search from the article:

继续(BeamSearch,b = 10):

独角兽能够相互交流,他们说独角兽.独角兽的声明.洛杉矶系教授,世界上最重要的地方,受到全世界的认可成为一个世界成为一个世界成为一个世界成为一个世界成为一个世界成为一个世界成为一个世界成为一个世界成为一个世界…… em>

如您所见,重复很多.

根据这篇论文,这种奇怪的情况可以用以下事实来解释:每个重复的单词序列比没有下一个重复的序列具有更高的概率:

According to the paper this curious case may be explained by the fact that each repeated sequence of words have higher probability than the sequence without the next repetition:

本文提出了一些解码器进行字采样的变通方法.当然,这需要更多的研究,但这是我们今天最好的解释.

The article propose some workarounds with words sampling by the decoder. It definitely requires more study, but this is the best explanation we have today.

另一个是您的模型需要更多的训练.在许多情况下,当我拥有大量的训练集并且模型仍然不能很好地概括整个数据多样性时,我会遇到类似的行为.要检验此假设,请尝试对较小的数据集进行训练,看看它是否可以推广(产生有意义的结果).

The other is that your model need still more training. In many cases I faced a similar behaviour when I had big training set and model still couldn't generalise well over whole diversity of the data. To test this hypothesis - try to train on smaller dataset and see if it generalise (produce meaningful results).

但是,即使您的模型具有足够好的泛化能力,这也不意味着您将永远不会遇到重复模式.除非您更改解码器的采样模式,否则这是常见的情况.

But even if your model generalise well enough, that doesn't mean you won't ever face the repetition pattern. Unless you change the sampling patter of the decoder, it is a common scenario.

这篇关于当Seq2Seq网络在输出中一遍又一遍地重复单词时该怎么办?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆