keras中的反向单词嵌入-python [英] reverse word embeddings in keras - python

查看：59 发布时间：2020/5/4 9:46:32 python python-2.7 machine-learning neural-network deep-learning

本文介绍了keras中的反向单词嵌入-python的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试在喀拉拉邦做一个聊天机器人.我为词汇表中的每个单词分配了自己的ID.一个训练样本如下所示:

I am trying to make a chatbot in keras. I am assigning every word in the vocabulary its own ID. One training sample looks like this:

[0 0 0 0 0 0 32 328 2839 13 192 1 ] -> [23 3289 328 2318 12 0 0 0 0 0 0 0]

然后，我使用Keras中的Embedding层将这些ID嵌入大小为32的向量中.然后，我将LSTM层用作隐藏层.问题是我的输出是这样的嵌入式ID列表.

Then I am using the Embedding layer in Keras to embedding these ID into vectors of size 32. Then I'm using LSTM layers as the hidden layers. The problem is that my output is a list of embedded ID's like so.

[ 0.16102183 0.1238187 0.1159694 0.13688719 0.12964118 0.12848872 0.13515817 0.13582146 0.16919741 0.15453722 ... ]

如何将这些嵌入词转换回原始词汇表中的单词?

How can I convert these embeddings back to the words in my original vocabulary?

这是我的代码:

from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from keras.models import Sequential, load_model
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

import os

import numpy as np
import cPickle as pickle


class Chatbot(object):

def __init__(self, h_layers=1):
    # self.name = name
    self.h_layers = h_layers
    self.seq2seq = None
    self.max_length = 0
    self.vocabulary = {}

@staticmethod
def load(model_name):
    with open('models/{}/chatbot_object.pkl'.format(model_name), 'rb') as pickle_file:
        obj = pickle.load(pickle_file)
    obj.seq2seq = load_model('models/{}/seq2seq.h5'.format(model_name))
    return obj

def train(self, x_train, y_train):
    count_vect = CountVectorizer()
    count_vect.fit(x_train)
    count_vect.fit(y_train)

    self.vocabulary = count_vect.vocabulary_
    self.vocabulary.update({'<START>': len(self.vocabulary),
                            '<END>': len(self.vocabulary) + 1,
                            '<PAD>': len(self.vocabulary) + 2,
                            '<UNK>': len(self.vocabulary) + 3})

    for i in range(len(x_train)):
        x_train[i] = ['<START>'] + [w.lower() for w in word_tokenize(x_train[i])] + ['<END>']
    for i in range(len(y_train)):
        y_train[i] = ['<START>'] + [w.lower() for w in word_tokenize(y_train[i])] + ['<END>']

    for sample in x_train:
        if len(sample) > self.max_length:
            self.max_length = len(sample)
    for sample in y_train:
        if len(sample) > self.max_length:
            self.max_length = len(sample)

    for i in range(len(x_train)):
        x_train[i] = [self.vocabulary[w] for w in x_train[i] if w in self.vocabulary]
    for i in range(len(y_train)):
        y_train[i] = [self.vocabulary[w] for w in y_train[i] if w in self.vocabulary]

    x_train = sequence.pad_sequences(x_train, maxlen=self.max_length, value=self.vocabulary['<PAD>'])
    y_train = sequence.pad_sequences(y_train, maxlen=self.max_length, padding='post',
                                     value=self.vocabulary['<PAD>'])

    x_train = np.asarray(x_train)
    y_train = np.asarray(y_train)

    embedding_vector_length = 32

    self.seq2seq = Sequential()
    self.seq2seq.add(Embedding(len(self.vocabulary), embedding_vector_length, input_length=self.max_length))

    for _ in range(self.h_layers):
        self.seq2seq.add(LSTM(self.max_length, return_sequences=True))

    self.seq2seq.add(LSTM(self.max_length))
    self.seq2seq.compile(loss='cosine_proximity', optimizer='adam', metrics=['accuracy'])
    self.seq2seq.fit(x_train[:100], y_train[:100], epochs=5, batch_size=32)

def save(self, filename):
    if filename not in os.listdir('models'):
        os.system('mkdir models/{}'.format(filename))
    self.seq2seq.save('models/{}/seq2seq.h5'.format(filename))
    self.seq2seq = None
    with open('models/{}/chatbot_object.pkl'.format(filename), 'wb') as pickle_file:
        pickle.dump(self, pickle_file)

def respond(self, text):
    tokens = ['<START>'] + [w.lower() for w in word_tokenize(text)] + ['<END>']
    for i in range(len(tokens)):
        if tokens[i] in self.vocabulary:
            tokens[i] = self.vocabulary[tokens[i]]
        else:
            tokens[i] = self.vocabulary['<PAD>']
    x = sequence.pad_sequences([tokens], maxlen=self.max_length, value=self.vocabulary['<PAD>'])
    prediction = self.seq2seq.predict(x, batch_size=1)
    return prediction[0]

keras中的反向单词嵌入-python [英] reverse word embeddings in keras - python

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

keras中的反向单词嵌入-python [英] reverse word embeddings in keras - python

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭