BERT句子嵌入:如何获取句子嵌入向量 [英] BERT sentence embeddings: how to obtain sentence embeddings vector
问题描述
我正在使用模块bert-for-tf2
来将BERT模型包装为Tensorflow 2.0中的Keras层.我已经按照您的指南将BERT模型实现为Keras层.
我正在尝试从句子中提取嵌入内容;在我的情况下,句子是你好"
I'm using the module bert-for-tf2
in order to wrap BERT model as Keras layer in Tensorflow 2.0 I've followed your guide for implementing BERT model as Keras layer.
I'm trying to extract embeddings from a sentence; in my case, the sentence is "Hello"
我对模型预测的输出有疑问;我已经写了这个模型:
I have a question about the output of the model prediction; I've written this model:
model_word_embedding = tf.keras.Sequential([
tf.keras.layers.Input(shape=(4,), dtype='int32', name='input_ids'),
bert_layer
])
model_word_embedding .build(input_shape=(None, 4))
然后我要提取上面写的句子的嵌入物:
Then I want to extract the embeddings for the sentence written above:
sentences = ["Hello"]
predict = model_word_embedding .predict(sentences)
对象预测包含4个数组,每个数组包含768个元素:
the object predict contains 4 arrays of 768 elements each:
print(predict)
print(len(predict))
print(len(predict[0][0]))
...
[[[-0.02768866 -0.7341324 1.9084396 ... -0.65953904 0.26496622
1.1610721 ]
[-0.19322394 -1.3134469 0.10383344 ... 1.1250225 -0.2988368
-0.2323082 ]
[-1.4576151 -1.4579685 0.78580517 ... -0.8898649 -1.1016986
0.6008501 ]
[ 1.41647 -0.92478925 -1.3651332 ... -0.9197768 -1.5469263
0.03305872]]]
4
768
我知道那个4的每个数组代表我的原始句子,但是我想获得一个数组作为我的原始句子的嵌入. 因此,我的问题是:如何获得句子的嵌入?
I know that each array of that 4 represents my original sentence, but I want to obtain one array as the embeddings of my original sentence. So, my question is: How can I obtain the embeddings for a sentence?
在BERT源代码中,我读到了这篇文章:
In BERT source code I read this:
对于分类任务,第一个向量(对应于[CLS])用作句子向量".请注意,这仅是有意义的,因为对整个模型进行了微调.
For classification tasks, the first vector (corresponding to [CLS]) is used as the "sentence vector." Note that this only makes sense because the entire model is fine-tuned.
所以我必须从预测输出中提取第一个数组,因为它代表了我的句子矢量?
So I have to take the first array from the prediction output since it represents my sentence vector?
感谢您的支持
推荐答案
我们应该使用最后隐藏状态中的[CLS]作为BERT的句子嵌入.根据BERT论文,[CLS]代表维度768的编码语句.下图更详细地表示[CLS]的使用.考虑到你有2000个句子.
We should use [CLS] from the last hidden states as the sentence embeddings from BERT. According to the BERT paper [CLS] represent the encoded sentence of dimension 768. Following figure represents the use of [CLS] in more details. considering you have 2000 sentences.
#input_ids consist of all sentences padded to max_len.
last_hidden_states = model(input_ids)
features = last_hidden_states[0][:,0,:].numpy() # considering o only the [CLS] for each sentences
features.shape
# (2000, 768) dimension
这篇关于BERT句子嵌入:如何获取句子嵌入向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!