如何使用BERT找到最接近向量的单词 [英] How to find the closest word to a vector using BERT

查看:172
本文介绍了如何使用BERT找到最接近向量的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 BERT 获取给定词嵌入的文本表示(或最接近的词).基本上我试图获得与 gensim 类似的功能:

<预><代码>>>>your_word_vector = array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)>>>model.most_similar(positive=[your_word_vector], topn=1))

到目前为止,我已经能够使用 bert-as-service 但无法弄清楚如何获得与此嵌入最接近的词.我使用了预训练的 bert 模型(uncased_L-12_H-768_A-12)并且没有做任何微调.

解决方案

TL;DR

按照 Jindtrich 的回答,我实现了一个上下文感知的最近邻搜索器.完整代码可在我的 Github gist

中找到

它需要一个类似 BERT 的模型(我使用 bert-embeddings)和一个语料库句子(我从这里 中取了一小部分),处理每个句子,并存储上下文标记嵌入在一个高效可搜索的数据结构中(我使用 KDTree,但可以随意选择 FAISS 或 HNSW 或其他).

示例

模型构造如下:

# 准备模型存储 = ContextNeighborStorage(sentences=all_sentences, 模型=bert)storage.process_sentences()storage.build_search_index()

然后可以查询上下文最相似的词,比如

# 查询模型距离、邻居、上下文 = storage.query(query_sent='这是一个移动电源.', query_word='bank', k=5)

在这个例子中,最近的邻居是句子Finally, there's the Duo的第二个版本,包含一个 2000mAH powerbank,翻转力量世界.".

然而,如果我们在另一个上下文中寻找同一个词,比如

距离、邻居、上下文 = storage.query(query_sent='这是一家投资银行.', query_word='bank', k=5)

那么最近的邻居将出现在句子The bank 也被授予 2017 年 12 月 31 日的 5 星,Superior Bauer 评级,财务数据."

如果我们不想检索单词bank"或其派生词,我们可以将它们过滤掉

距离、邻居、上下文 = storage.query(query_sent='这是一家投资银行.', query_word='bank', k=5, filter_same_word=True)

然后最近的邻居将是Cahal is the Vice President of Deloitte UK and Chief of the Advisory Corporate Finance"中的单词finance"2014 年开始的业务(之前从 2005 年开始领导业务).".

在NER中的应用

这种方法的一个很酷的应用是可解释的命名实体识别.我们可以用IOB标记的例子填充搜索索引,然后使用检索到的例子来推断查询词的正确标签.

例如,Bezos 宣布其两日送达服务亚马逊 Prime 的全球订阅用户已超过 1 亿."是>扩展了第三方集成,包括 Amazon Alexa、Google Assistant 和 IFTTT.".

但是对于大西洋有足够的波浪和潮汐能将亚马逊的大部分沉积物带到大海中,因此这条河流并没有形成真正的三角洲" 最近的邻居是而且,今年我们的故事是从巴西的伊瓜苏瀑布到亚特兰大的养鸡场".

因此,如果这些邻居被标记,我们可以推断在第一个上下文中Amazon"是一个 ORGanization,但在第二个上下文中它是一个 LOCation.

代码

这是完成这项工作的类:

将 numpy 导入为 np从 sklearn.neighbors 导入 KDTree从 tqdm.auto 导入 tqdm类 ContextNeighborStorage:def __init__(self,句子,模型):self.sentences = 句子self.model = 模型def process_sentences(self):结果 = self.model(self.sentences)self.sentence_ids = []self.token_ids = []self.all_tokens = []all_embeddings = []对于 enumerate(tqdm(result)) 中的 i, (toks, embs):对于 j, (tok, emb) 在 enumerate(zip(toks, embs)) 中:self.sentence_ids.append(i)self.token_ids.append(j)self.all_tokens.append(tok)all_embeddings.append(emb)all_embeddings = np.stack(all_embeddings)# 我们对嵌入进行归一化,使得欧几里德距离等价于余弦距离self.normed_embeddings = (all_embeddings.T/(all_embeddings**2).sum(axis=1) ** 0.5).Tdef build_search_index(self):# 这需要一些时间self.indexer = KDTree(self.normed_embeddings)def query(self, query_sent, query_word, k=10, filter_same_word=False):toks, embs = self.model([query_sent])[0]发现 = 错误对于tok,嵌入zip(toks,embs):如果tok == query_word:发现 = 真休息如果没有找到:raise ValueError('The query word {} is not a single token in sentence {}'.format(query_word, toks))emb = emb/sum(emb**2)**0.5如果 filter_same_word:initial_k = max(k, 100)别的:初始_k = kdi, idx = self.indexer.query(emb.reshape(1, -1), k=initial_k)距离 = []邻居 = []上下文 = []对于 i, enumerate(idx.ravel()) 中的索引:token = self.all_tokens[index]如果 filter_same_word 和(query_word in token 或 token in query_word):继续distances.append(di.ravel()[i])邻居.附加(令牌)contexts.append(self.sentences[self.sentence_ids[index]])如果 len(distances) == k:休息返回距离、邻居、上下文

I am trying to get textual representation(or the closest word) of given word embedding using BERT. Basically I am trying to get similar functionality as in gensim:

>>> your_word_vector = array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)
>>> model.most_similar(positive=[your_word_vector], topn=1))

So far, I have been able to generate contextual word embedding using bert-as-service but can't figure out how to get closest words to this embedding. I have used pre-trained bert model (uncased_L-12_H-768_A-12) and haven't done any fine tuning.

解决方案

TL;DR

Following the Jindtrich's answer I implement a context-aware nearest neighbor searcher. The full code is available in my Github gist

It requires a BERT-like model (I use bert-embeddings) and a corpus of sentences (I took a small one from here), processes each sentence, and stores contextual token embeddings in an efficiently searchable data structure (I use KDTree, but feel free to choose FAISS or HNSW or whatever).

Examples

The model is constructed as follows:

# preparing the model
storage = ContextNeighborStorage(sentences=all_sentences, model=bert)
storage.process_sentences()
storage.build_search_index()

Then it can be queried for contextually most similar words, like

# querying the model
distances, neighbors, contexts = storage.query(
    query_sent='It is a power bank.', query_word='bank', k=5)

In this example, the nearest neighbor would be the word "bank" in the sentence "Finally, there’s a second version of the Duo that incorporates a 2000mAH power bank, the Flip Power World.".

If, however, we look for the same word with another context, like

distances, neighbors, contexts = storage.query(
    query_sent='It is an investment bank.', query_word='bank', k=5)

then the nearest neighbor will be in the sentence "The bank also was awarded a 5-star, Superior Bauer rating for Dec. 31, 2017, financial data."

If we don't want to retrieve the word "bank" or its derivative word, we can filter them out

distances, neighbors, contexts = storage.query(
     query_sent='It is an investment bank.', query_word='bank', k=5, filter_same_word=True)

and then the nearest neighbor will be the word "finance" in the sentence "Cahal is Vice Chairman of Deloitte UK and Chairman of the Advisory Corporate Finance business from 2014 (previously led the business from 2005).".

Application in NER

One of the cool applications of this approach is interpretable named entity recognition. We can fill the search index with IOB-labeled examples, and then use retrieved examples to infer the right label for the query word.

For example, the nearest neighbor of "Bezos announced that its two-day delivery service, Amazon Prime, had surpassed 100 million subscribers worldwide." is "Expanded third-party integration including Amazon Alexa, Google Assistant, and IFTTT.".

But for "The Atlantic has sufficient wave and tidal energy to carry most of the Amazon's sediments out to sea, thus the river does not form a true delta" the nearest neighbor is "And, this year our stories are the work of traveling from Brazil’s Iguassu Falls to a chicken farm in Atlanta".

So if these neighbors were labeled, we could infer that in the first context "Amazon" is an ORGanization, but in the second one it is a LOCation.

The code

Here is the class that does this work:

import numpy as np
from sklearn.neighbors import KDTree
from tqdm.auto import tqdm


class ContextNeighborStorage:
    def __init__(self, sentences, model):
        self.sentences = sentences
        self.model = model

    def process_sentences(self):
        result = self.model(self.sentences)

        self.sentence_ids = []
        self.token_ids = []
        self.all_tokens = []
        all_embeddings = []
        for i, (toks, embs) in enumerate(tqdm(result)):
            for j, (tok, emb) in enumerate(zip(toks, embs)):
                self.sentence_ids.append(i)
                self.token_ids.append(j)
                self.all_tokens.append(tok)
                all_embeddings.append(emb)
        all_embeddings = np.stack(all_embeddings)
        # we normalize embeddings, so that euclidian distance is equivalent to cosine distance
        self.normed_embeddings = (all_embeddings.T / (all_embeddings**2).sum(axis=1) ** 0.5).T

    def build_search_index(self):
        # this takes some time
        self.indexer = KDTree(self.normed_embeddings)

    def query(self, query_sent, query_word, k=10, filter_same_word=False):
        toks, embs = self.model([query_sent])[0]

        found = False
        for tok, emb in zip(toks, embs):
            if tok == query_word:
                found = True
                break
        if not found:
            raise ValueError('The query word {} is not a single token in sentence {}'.format(query_word, toks))
        emb = emb / sum(emb**2)**0.5

        if filter_same_word:
            initial_k = max(k, 100)
        else:
            initial_k = k
        di, idx = self.indexer.query(emb.reshape(1, -1), k=initial_k)
        distances = []
        neighbors = []
        contexts = []
        for i, index in enumerate(idx.ravel()):
            token = self.all_tokens[index]
            if filter_same_word and (query_word in token or token in query_word):
                continue
            distances.append(di.ravel()[i])
            neighbors.append(token)
            contexts.append(self.sentences[self.sentence_ids[index]])
            if len(distances) == k:
                break
        return distances, neighbors, contexts

这篇关于如何使用BERT找到最接近向量的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆