预训练词嵌入gensim上的LSTM网络 [英] LSTM network on pre trained word embedding gensim
问题描述
我是深度学习的新手.我正在尝试在词嵌入功能方面建立非常基本的LSTM网络.我已经为模型编写了以下代码,但无法运行.
I am new to deep learning. I am trying to make very basic LSTM network on word embedding feature. I have written the following code for the model but I am unable to run it.
from keras.layers import Dense, LSTM, merge, Input,Concatenate
from keras.layers.recurrent import LSTM
from keras.models import Sequential, Model
from keras.layers import Dense, Dropout, Flatten
max_sequence_size = 14
classes_num = 2
LSTM_word_1 = LSTM(100, activation='relu',recurrent_dropout = 0.25, dropout = 0.25)
lstm_word_input_1 = Input(shape=(max_sequence_size, 300))
lstm_word_out_1 = LSTM_word_1(lstm_word_input_1)
merged_feature_vectors = Dense(50, activation='sigmoid')(Dropout(0.2)(lstm_word_out_1))
predictions = Dense(classes_num, activation='softmax')(merged_feature_vectors)
my_model = Model(input=[lstm_word_input_1], output=predictions)
print my_model.summary()
我得到的错误是 ValueError:检查输入时出错:预期input_1具有3维,但数组的形状为(3019,300)
.在搜索时,我发现人们已经使用了 Flatten()
,它将压缩密集层的所有二维特征(3019,300).但我无法解决此问题.
The error I am getting is ValueError: Error when checking input: expected input_1 to have 3 dimensions, but got array with shape (3019, 300)
. On searching, I found that people have used Flatten()
which will compress all the 2-D features (3019,300) for the dense layer. But I am unable to fix the issue.
在解释的同时,请告诉我尺寸的计算方式.
While explaining, kindly let me know how do the dimension work out.
根据要求:
我的X_training遇到尺寸问题,因此,我在下面提供了代码,以消除混乱,
My X_training had dimension issues, so I am providing the code below to clear out the confusion,
def makeFeatureVec(words, model, num_features):
# Function to average all of the word vectors in a given
# paragraph
#
# Pre-initialize an empty numpy array (for speed)
featureVec = np.zeros((num_features,),dtype="float32")
#
nwords = 0.
#
# Index2word is a list that contains the names of the words in
# the model's vocabulary. Convert it to a set, for speed
index2word_set = set(model.wv.index2word)
#
# Loop over each word in the review and, if it is in the model's
# vocaublary, add its feature vector to the total
for word in words:
if word in index2word_set:
nwords = nwords + 1.
featureVec = np.add(featureVec,model[word])
#
# Divide the result by the number of words to get the average
featureVec = np.divide(featureVec,nwords)
return featureVec
我认为下面的代码正在以这种方式初始化二维numpy数组
I think the following code is giving 2-D numpy array as I am initializing it that way
def getAvgFeatureVecs(reviews, model, num_features):
# Given a set of reviews (each one a list of words), calculate
# the average feature vector for each one and return a 2D numpy array
#
# Initialize a counter
counter = 0.
#
# Preallocate a 2D numpy array, for speed
reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
for review in reviews:
if counter%1000. == 0.:
print "Question %d of %d" % (counter, len(reviews))
reviewFeatureVecs[int(counter)] = makeFeatureVec(review, model, \
num_features)
counter = counter + 1.
return reviewFeatureVecs
def getCleanReviews(reviews):
clean_reviews = []
for review in reviews["question"]:
clean_reviews.append( KaggleWord2VecUtility.review_to_wordlist( review, remove_stopwords=True ))
return clean_reviews
我的目标只是对我已有的一些评论使用gensim预训练模型进行LSTM.
My objective is just to use gensim pretrained model for LSTM on some comments that I have.
trainDataVecs = getAvgFeatureVecs( getCleanReviews(train), model, num_features )
推荐答案
您应该尝试在LSTM层之前使用 Embedding层
.另外,由于您已经为3019条注释预训练了300维矢量,因此可以使用此矩阵初始化嵌入层的权重.
You should try using Embedding layer
before LSTM layer. Also, since you have pre-trained vectors of 300-dimensions for 3019 comments, you can initialize the weights for embedding layer with this matrix.
inp_layer = Input((maxlen,))
x = Embedding(max_features, embed_size, weights=[trainDataVecs])(x)
x = LSTM(50, dropout=0.1)(x)
在此, maxlen
是您的注释的最大长度, max_features
是您的数据集的唯一单词或词汇量的最大数量,以及 embed_size code>是向量的尺寸,在您的情况下为300.
Here, maxlen
is the maximum length of your comments, max_features
is the maximum number of unique words or vocabulary size of your dataset, and embed_size
is dimensions of your vectors, which is 300 in your case.
请注意,trainDataVecs的形状应为(max_features,embed_size),因此,如果已将预训练的单词向量加载到 trainDataVecs
中,则此方法应该可以工作.
Note that shape of trainDataVecs should be (max_features, embed_size), so if you have pre-trained word vectors loaded into trainDataVecs
, this should work.
这篇关于预训练词嵌入gensim上的LSTM网络的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!