为 OOV 词添加新向量的正确方法 [英] Proper way to add new vectors for OOV words

查看:126
本文介绍了为 OOV 词添加新向量的正确方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用一些特定领域的语言,其中包含大量 OOV 词和一些拼写错误.我注意到 Spacy 只会为这些 OOV 词分配一个全零向量,所以我想知道处理这个问题的正确方法是什么.如果可能的话,我感谢您对所有这些要点进行澄清:

  1. pre-train 命令究竟有什么作用?老实说,我似乎无法正确解析网站上的解释:

<块引用>

使用近似的语言建模目标对管道组件的令牌到向量"(tok2vec) 层进行预训练.具体来说,我们加载预训练向量,并训练像 CNN、BiLSTM 等组件来预测与预训练向量匹配的向量

tok2vec 不是生成向量的部分吗?那么这个命令不应该改变产生的向量吗?加载预训练向量然后训练一个组件来预测这些向量是什么意思?这样做的目的是什么?

--use-vectors 标志有什么作用?--init-tok2vec 标志有什么作用?这是否错误地包含在文档中?

  1. 似乎预训练不是我要找的,它不会改变给定单词的向量.生成一组新的向量的最简单方法是什么,其中包含我的 OOV 词但仍包含语言的一般知识?

  2. 据我所知,Spacy 的预训练模型使用 fasttext 向量.Fasttext 网站提及:

<块引用>

一个不错的功能是,您还可以查询未出现在您的数据中的词!实际上,单词由其子串的总和表示.只要未知词是由已知子串组成的,就有它的表示!

但似乎 Spacy 不使用此功能.有没有办法仍然将这个用于OOV词?

非常感谢

解决方案

我认为对于不同的组件存在一些混淆 - 我将尝试澄清:

  1. 分词器不产生向量.它只是一个组件将文本分割成标记.在 spaCy 中,它是基于规则的,而不是可训练,与向量无关.它看着空格和标点符号以确定句子中哪些是唯一标记.
  2. spaCy 中的 nlp 模型可以具有可在 Token 级别访问的预定义​​(静态)词向量.每个令牌都相同Lexeme 获得相同的向量.一些标记/词素可能确实是OOV,就像拼写错误一样.如果要重新定义/扩展所有向量在模型中使用,您可以使用类似 init-model.
  3. tok2vec 层是一个机器学习组件,用于学习如何为标记生成合适的(动态)向量.它通过查看来做到这一点在标记的词法属性上,但也可能包括静态标记的向量(参见第 2 项).该组件通常不单独使用,而是另一个组件的一部分,例如 NER.它将是 NER 模型的第一层,并且可以作为训练 NER 的一部分进行训练,以生成适合您的 NER 任务的向量.

在 spaCy v2 中,您可以先使用 pretrain 训练一个 tok2vec 组件,然后将该组件用于后续的 train 命令.请注意,两个命令中的所有设置都必须相同,以使图层兼容.

回答您的问题:

<块引用>

tok2vec 不是生成向量的部分吗?

如果您指的是静态向量,则不是.tok2vec 组件在静态向量之上生成新向量(可能具有不同的维度),但不会改变静态向量.

<块引用>

加载预训练向量然后训练一个组件来预测这些向量是什么意思?这样做的目的是什么?

目的是获得一个已经从外部向量数据预训练的 tok2vec 组件.外部向量数据已经嵌入了一些含义"或相似性"令牌,这可以说是转移到 tok2vec 组件中,该组件学习产生相同的相似性.关键是可以使用这个新的 tok2vec 组件 &在随后的 train 命令中进一步微调(参见第 3 项)

<块引用>

有没有办法仍然将这个用于 OOV 词?

这实际上取决于您的用途"是.正如 https://stackoverflow.com/a/57665799/7961860 所述,您可以自己设置向量,也可以实现一个用户 hook 它将 决定如何定义token.vector.

我希望这会有所帮助.在不了解为什么需要 OOV 向量/用例是什么的情况下,我无法真正为您推荐最佳方法.很高兴在评论中进一步讨论!

I'm using some domain-specific language which have a lot of OOV words as well as some typos. I have noticed Spacy will just assign an all-zero vector for these OOV words, so I'm wondering what's the proper way to handle this. I appreciate clarification on all of these points if possible:

  1. What exactly does the pre-train command do? Honestly I cannot seem to parse correctly the explanation from the website:

Pre-train the "token to vector" (tok2vec) layer of pipeline components, using an approximate language-modeling objective. Specifically, we load pretrained vectors, and train a component like a CNN, BiLSTM, etc to predict vectors which match the pretrained ones

Isn't the tok2vec the part that generates the vectors? So shouldn't this command then change the produced vectors? What does it mean loading pretrained vectors and then train a component to predict these vectors? What's the purpose of doing this?

What does the --use-vectors flag do? What does the --init-tok2vec flag do? Is this included by mistake in the documentation?

  1. It seems pretrain is not what I'm looking for, it doesn't change the vectors for a given word. What would be the easiest way to generate a new set of vectors which includes my OOV words but still contain the general knowledge of the lanaguage?

  2. As far as I can see Spacy's pretrained models use fasttext vectors. Fasttext website mentions:

A nice feature is that you can also query for words that did not appear in your data! Indeed words are represented by the sum of its substrings. As long as the unknown word is made of known substrings, there is a representation of it!

But it seems Spacy does not use this feature. Is there a way to still make use of this for OOV words?

Thanks a lot

解决方案

I think there is some confusion about the different components - I'll try to clarify:

  1. The tokenizer does not produce vectors. It's just a component that segments texts into tokens. In spaCy, it's rule-based and not trainable, and doesn't have anything to do with vectors. It looks at whitespace and punctuation to determine which are the unique tokens in a sentence.
  2. An nlp model in spaCy can have predefined (static) word vectors that are accessible on the Token level. Every token with the same Lexeme gets the same vector. Some tokens/lexemes may indeed be OOV, like misspellings. If you want to redefine/extend all vectors used in a model, you can use something like init-model.
  3. The tok2vec layer is a machine learning component that learns how to produce suitable (dynamic) vectors for tokens. It does this by looking at lexical attributes of the token, but may also include the static vectors of the token (cf item 2). This component is generally not used by itself, but is part of another component, such as an NER. It will be the first layer of the NER model, and it can be trained as part of training the NER, to produce vectors that are suitable for your NER task.

In spaCy v2, you can first train a tok2vec component with pretrain, and then use this component for a subsequent train command. Note that all settings need to be the same across both commands, for the layers to be compatible.

To answer your questions:

Isn't the tok2vec the part that generates the vectors?

If you mean the static vectors, then no. The tok2vec component produces new vectors (possibly with a different dimension) on top of the static vectors, but it won't change the static ones.

What does it mean loading pretrained vectors and then train a component to predict these vectors? What's the purpose of doing this?

The purpose is to get a tok2vec component that is already pretrained from external vectors data. The external vectors data already embeds some "meaning" or "similarity" of the tokens, and this is -so to say- transferred into the tok2vec component, which learns to produce the same similarities. The point is that this new tok2vec component can then be used & further fine-tuned in the subsequent train command (cf item 3)

Is there a way to still make use of this for OOV words?

It really depends on what your "use" is. As https://stackoverflow.com/a/57665799/7961860 mentions, you can set the vectors yourself, or you can implement a user hook which will decide on how to define token.vector.

I hope this helps. I can't really recommend the best approach for you to follow, without understanding why you want the OOV vectors / what your use-case is. Happy to discuss further in the comments!

这篇关于为 OOV 词添加新向量的正确方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆