加载word2vec时出现UnicodeDecodeError错误 [英] UnicodeDecodeError error when loading word2vec

查看:504
本文介绍了加载word2vec时出现UnicodeDecodeError错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

完整说明

我开始研究单词嵌入,并发现了大量有关它的信息.到目前为止,我了解到,我可以训练自己的词向量,也可以使用以前训练有素的词向量,例如Google或Wikipedia的词向量,这些词向量可用于英语,但对我没有用,因为我正在处理巴西葡萄牙语.因此,我去寻找葡萄牙语中的预训练词向量,最终发现 WordVectors 了解了Rami Al-Rfou的 Polyglot .下载完两个文件后,我一直尝试仅加载单词向量失败.

I am starting to work with word embedding and found a great amount of information about it. I understand, this far, that I can train my own word vectors or use previously trained ones, such as Google's or Wikipedia's, which are available for the English language and aren't useful to me, since I am working with texts in Brazilian Portuguese. Therefore, I went on a hunt for pre-trained word vectors in Portuguese and I ended up finding Hirosan's List of Pretrained Word Embeddings which led me to Kyubyong's WordVectors from which I learned about Rami Al-Rfou's Polyglot. After downloading both, I unsuccessfully have been trying to simply load the word vectors.

简短说明

我无法加载预训练的单词向量;我正在尝试 WordVectors

I can't load pre-trained word vectors; I am trying WordVectors and Polyglot.

下载

  • Kyubyong's pre-trained word2vector format word vectors for Portuguese;
  • Polyglot's pre-trained word vectors for Portuguese;

加载尝试

Kyubyong的 WordVectors 首次尝试:按照 Hirosan 的建议使用Gensim a>;

Kyubyong's WordVectors First attempt: using Gensim as suggested by Hirosan;

from gensim.models import KeyedVectors
kyu_path = '.../pre-trained_word_vectors/kyubyong_pt/pt.bin'
word_vectors = KeyedVectors.load_word2vec_format(kyu_path, binary=True)

并返回错误:

[...]
File "/Users/luisflavio/anaconda3/lib/python3.6/site-packages/gensim/utils.py", line 359, in any2unicode
return unicode(text, encoding, errors=errors)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

下载的zip文件还包含其他文件,但所有文件都返回相似的错误.

The zip downloaded also contains other files but all of them return similar errors.

多语言 首次尝试:遵循 Al-Rfous的指示;

Polyglot First attempt: following Al-Rfous's instructions;

import pickle
import numpy
pol_path = '.../pre-trained_word_vectors/polyglot/polyglot-pt.pkl'
words, embeddings = pickle.load(open(pol_path, 'rb'))

并返回错误:

File "/Users/luisflavio/Desktop/Python/w2v_loading_tries.py", line 14, in <module>
    words, embeddings = pickle.load(open(polyglot_path, "rb"))

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd4 in position 1: ordinal not in range(128)

第二次尝试:使用 Polyglot的词嵌入加载功能;

首先,我们必须通过pip安装polyglot:

First, we have to install polyglot via pip:

pip install polyglot

现在我们可以导入它:

from polyglot.mapping import Embedding
pol_path = '.../pre-trained_word_vectors/polyglot/polyglot-pt.pkl'
embeddings = Embedding.load(polyglot_path)

并返回错误:

File "/Users/luisflavio/anaconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

其他信息

我在MacOS High Sierra上使用python 3.

I am using python 3 on MacOS High Sierra.

解决方案

Kyubyong的 WordVectors 正如 Aneesh Joshi 所指出的那样,加载Kyubyong模型的正确方法是调用Word2Vec的本机加载函数.

Kyubyong's WordVectors As pointed out by Aneesh Joshi, the correct way to load Kyubyong's model is by calling the native load function of Word2Vec.

from gensim.models import Word2Vec
kyu_path = '.../pre-trained_word_vectors/kyubyong_pt/pt.bin'
model = Word2Vec.load(kyu_path)

即使我非常感谢Aneesh Joshi解决方案,但多语种似乎是与葡萄牙语合作的更好模型.关于那个有什么想法吗?

Even though I am more than grateful for Aneesh Joshi solution, polyglot seems to be a better model for working with Portuguese. Any ideas about that one?

推荐答案

对于Kyubyong的预训练word2vector .bin文件: 可能是使用gensim的保存功能保存的.

For Kyubyong's pre-trained word2vector .bin file: it may have been saved using gensim's save function.

使用load()加载模型.不使用load_word2vec_format(这是C工具的兼容性)."

"load the model with load(). Not load_word2vec_format (that's for the C-tool compatibility)."

model = Word2Vec.load(fname)

让我知道是否可行.

参考文献: Gensim邮件列表

这篇关于加载word2vec时出现UnicodeDecodeError错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆