keras理解词嵌入层 [英] keras understanding Word Embedding Layer

查看:145
本文介绍了keras理解词嵌入层的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

页面中,我得到了以下代码:

From the page I got the below code:

from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
# define documents
docs = ['Well done!',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!',
        'Weak',
        'Poor effort!',
        'not good',
        'poor work',
        'Could have done better.']
# define class labels
labels = array([1,1,1,1,1,0,0,0,0,0])
# integer encode the documents
vocab_size = 50
encoded_docs = [one_hot(d, vocab_size) for d in docs]
print(encoded_docs)
# pad documents to a max length of 4 words
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)
# define the model
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

  1. 我查看了encoded_docs,发现单词donework都具有one_hot编码为2的代码,为什么?是因为unicity of word to index mapping non-guaranteed.与此页面一致?
  2. 我通过命令embeddings = model.layers[0].get_weights()[0]获得了embeddings.在这种情况下,为什么我们得到大小为50的embedding对象?即使两个单词具有相同的one_hot数字,但它们的嵌入方式是否不同?
  3. 我怎么能理解哪个单词是哪个嵌入,即done vs work
  4. 我还在页面上找到了以下代码可能有助于查找每个单词的嵌入.但是我不知道如何创建word_to_index

  1. I looked at encoded_docs and noticed that words done and work both have one_hot encoding of 2, why? Is it because unicity of word to index mapping non-guaranteed. as per this page?
  2. I got embeddings by command embeddings = model.layers[0].get_weights()[0]. in such case why do we get embedding object of size 50? Even though two words have same one_hot number, do they have different embedding?
  3. how could i understand which embedding is for which word i.e. done vs work
  4. I also found below code at the page that could help with finding embedding of each word. But i dont know how to create word_to_index

word_to_index是从单词到其索引的映射(即dict). love:69 words_embeddings = {w:embeddings [idx] for w,word_to_index.items()中的idx}

word_to_index is a mapping (i.e. dict) from words to their index, e.g. love: 69 words_embeddings = {w:embeddings[idx] for w, idx in word_to_index.items()}

请确保我对para #的理解是正确的.

Please ensure that my understanding of para # is correct.

第一层具有400个参数,因为总字数为50,并且嵌入具有8个维度,因此50 * 8 = 400.

The first layer has 400 parameters because total word count is 50 and embedding have 8 dimensions so 50*8=400.

最后一层具有33个参数,因为每个句子最多包含4个单词.因此,由于嵌入尺寸为4 * 8,偏置为1.共33个

The last layer has 33 parameters because each sentence has 4 words max. So 4*8 due to dimensions of embedding and 1 for bias. 33 total

_________________________________________________________________
Layer (type)                 Output Shape              Param#   
=================================================================
embedding_3 (Embedding)      (None, 4, 8)              400       
_________________________________________________________________
flatten_3 (Flatten)          (None, 32)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 33        
=================================================================

  1. 最后,如果上面的1是正确的,是否有一种更好的方法来获得嵌入层model.add(Embedding(vocab_size, 8, input_length=max_length)) 而无需进行一次热编码encoded_docs = [one_hot(d, vocab_size) for d in docs]
  1. Finally, if 1 above is correct, is there a better way to get embedding layer model.add(Embedding(vocab_size, 8, input_length=max_length)) without doing one hot coding encoded_docs = [one_hot(d, vocab_size) for d in docs]

++++++++++++++++++++++++++++++++++ 更新-提供更新的代码

+++++++++++++++++++++++++++++++ update - providing the updated code

from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
# define documents
docs = ['Well done!',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!',
        'Weak',
        'Poor effort!',
        'not good',
        'poor work',
        'Could have done better.']
# define class labels
labels = array([1,1,1,1,1,0,0,0,0,0])


from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()

#this creates the dictionary
#IMPORTANT: MUST HAVE ALL DATA - including Test data
#IMPORTANT2: This method should be called only once!!!
tokenizer.fit_on_texts(docs)

#this transforms the texts in to sequences of indices
encoded_docs2 = tokenizer.texts_to_sequences(docs)

encoded_docs2

max_length = 4
padded_docs2 = pad_sequences(encoded_docs2, maxlen=max_length, padding='post')
max_index = array(padded_docs2).reshape((-1,)).max()



# define the model
model = Sequential()
model.add(Embedding(max_index+1, 8, input_length=max_length))# you cannot use just max_index 
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_docs2, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs2, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

embeddings = model.layers[0].get_weights()[0]

embeding_for_word_7 = embeddings[14]
index = tokenizer.texts_to_sequences([['well']])[0][0]
tokenizer.document_count
tokenizer.word_index

推荐答案

1-是的,不能保证单词的唯一性,请参见

1 - Yes, word unicity is not guaranteed, see the docs:

  • 来自one_hot:这是hashing_trick函数的包装...
  • 来自hashing_trick:由于散列函数可能发生的冲突,可能会将两个或多个单词分配给同一索引.散列空间和不同对象的数量."
  • From one_hot: This is a wrapper to the hashing_trick function...
  • From hashing_trick: "Two or more words may be assigned to the same index, due to possible collisions by the hashing function. The probability of a collision is in relation to the dimension of the hashing space and the number of distinct objects."

为此最好使用Tokenizer. (请参阅问题4)

It would be better to use a Tokenizer for this. (See question 4)

切记非常重要,在创建索引时应同时包含所有所有单词.您不能使用函数来创建包含2个单词的词典,然后再创建2个单词,再创建一个词典.这将创建非常错误的词典.

It's very important to remember that you should involve all words at once when creating indices. You cannot use a function to create a dictionary with 2 words, then again with 2 words, then again.... This will create very wrong dictionaries.

2-嵌入的大小为50 x 8,因为它是在嵌入层中定义的:

2 - Embeddings have the size 50 x 8, because that was defined in the embedding layer:

Embedding(vocab_size, 8, input_length=max_length)

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆