有什么方法可以提取google通用句子编码器大的详尽词汇? [英] Any way to extract the exhaustive vocabulary of the google universal sentence encoder large?

查看:24
本文介绍了有什么方法可以提取google通用句子编码器大的详尽词汇?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些句子,我正在为其创建嵌入,它非常适合相似性搜索,除非句子中有一些真正不寻常的词.

I have some sentences for which I am creating an embedding and it works great for similarity searching unless there are some truly unusual words in the sentence.

在这种情况下,真正不寻常的单词实际上包含句子中任何单词的最相似信息,但由于该单词显然不在模型的词汇表中,因此所有这些信息在嵌入过程中都丢失了.

In that case, the truly unusual words in fact contain the very most similarity information of any words in the sentence BUT all of that information is lost during embedding due to the fact that the word is apparently not in the vocabulary of the model.

我想获得 GUSE 嵌入模型已知的所有单词的列表,以便我可以将这些已知单词从句子中屏蔽掉,只留下新"单词.

I'd like to get a list of all of the words known by the GUSE embedding model so that I can mask those known words out of my sentence, leaving only the "novel" words.

然后,我可以对目标语料库中的那些新词进行精确的词搜索,并实现类似句子搜索的可用性.

I can then do an exact word search for those novel words in my target corpus and achieve usability for my similar sentence searching.

例如我喜欢使用 Xapian!"被嵌入为我喜欢使用 UNK".

e.g. "I love to use Xapian!" gets embedded as "I love to use UNK".

如果我只对Xapian"进行关键字搜索而不是语义相似性搜索,我会得到比使用 GUSE 和向量 KNN 更相关的结果.

If I just do a keyword search for "Xapian" instead of a semantic similarity search, I'll get much more relevant results than I would using GUSE and vector KNN.

关于如何提取 GUSE 已知/使用的词汇有什么想法吗?

Any ideas on how I can extract the vocabulary known/used by GUSE?

推荐答案

我假设你有 tensorflow &tensorflow_hub 已安装,您已经下载了模型.

I'm assuming you have tensorflow & tensorflow_hub installed, and youhave already downloaded the model.

重要提示:我假设您正在查看 https://tfhub.dev/google/universal-sentence-encoder/4!不能保证不同版本的对象图看起来相同,很可能需要修改.

IMPORTANT: I'm assuming you're looking at https://tfhub.dev/google/universal-sentence-encoder/4! There's no guarantee the object graph looks the same for different versions, it's likely that modifications will be needed.

在磁盘上查找它的位置 - 它位于 /tmp/tfhub_modules 的某个位置,除非您设置了 TFHUB_CACHE_DIR 环境变量(Windows/Mac 有不同的位置).该路径应包含一个名为 saved_model.pb 的文件,该文件是使用 Protocol Buffers 序列化的模型.

Find it's location on disk - it's somewhere at /tmp/tfhub_modules unless you set the TFHUB_CACHE_DIR environment variable (Windows/Mac have different locations). The path should contain a file called saved_model.pb, which is the model, serialized using Protocol Buffers.

不幸的是,字典在模型的 Protocol Buffers 文件中被序列化,而不是作为外部资产,所以我们必须加载模型并从中获取变量.

Unfortunately, the dictionary is serialized inside the model's Protocol Buffers file and not as an external asset, so we'll have to load the model and get the variable from it.

策略是使用 tensorflow 的代码对文件进行反序列化,然后沿着序列化的对象树一直向下移动到字典.

The strategy is to use tensorflow's code to deserialize the file, and then travel down the serialized object tree all the way to the dictionary.

import importlib

MODEL_PATH = 'path/to/model/dir' # e.g. '/tmp/tfhub_modules/063d866c06683311b44b4992fd46003be952409c/'

# Use the tensorflow internal Protobuf loader. A regular import statement will fail.
loader_impl = importlib.import_module('tensorflow.python.saved_model.loader_impl')

saved_model = loader_impl.parse_saved_model(MODEL_PATH)

# reach into the object graph to get the tensor
graph = saved_model.meta_graphs[0].graph_def
function = graph.library.function
node_type, node_value = function[5].node_def
# if you print(node_type) you'll see it's called "text_preprocessor/hash_table"
# as well as get insight into this branch of the object graph we're looking at
words_tensor = node_value.attr.get("value").tensor

word_list = [i.decode('utf-8') for i in words_tensor.string_val]
print(len(word_list)) # -> 400004

一些有用的资源:

  1. 与更改词汇表有关的 GitHub 问题
  2. 一个 Tensorflow Google 群组线程链接从问题

额外说明

尽管 GitHub 问题可能会让您想到什么,但这里的 400k 字并不是 GloVe 400k 词汇.您可以通过下载 GloVe 6B 嵌入(文件链接)来验证这一点,提取glove.6B.50d.txt,然后用下面的代码比较两个字典:

Extra Notes

Despite what the GitHub issue may lead you to think, the 400k words here are not the GloVe 400k vocabulary. You can verify this by downloading the GloVe 6B embeddings (file link), extracting glove.6B.50d.txt, and then using the following code to compare the two dictionaries:

with open('/path/to/glove.6B.50d.txt') as f:
    glove_vocabulary = set(line.strip().split(maxsplit=1)[0] for line in f)

USE_vocabulary = set(word_list) # from above

print(len(USE_vocabulary - glove_vocabulary)) # -> 281150

检查不同的词汇本身很有趣,例如为什么 GloVe 有287.9"的条目?

Inspecting the different vocabularies is interesting in and of itself, e.g. why does GloVe have an entry for '287.9'?

这篇关于有什么方法可以提取google通用句子编码器大的详尽词汇?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆