加载word2vec模块时出现'utf-8'解码错误 [英] 'utf-8' decode error when loading a word2vec module
问题描述
我必须使用包含大量汉字的word2vec模块.该模块由我的同事使用Java进行了培训,并保存为bin文件.
I have to use a word2vec module containing tons of Chinese characters. The module was trained by my coworkers using Java and is saved as a bin file.
我安装了 gensim 并尝试加载该模块,但是发生了以下错误:
I installed gensim and tries to load the module, but following error occurred:
In [1]: import gensim
In [2]: model = gensim.models.Word2Vec.load_word2vec_format('/data5/momo-projects/user_interest_classification/code/word2vec/vectors_groups_1105.bin', binary=True)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 96-97: unexpected end of data
我尝试同时在python 2.7和3.5中加载模块,但均以相同的方式失败.那么如何在gensim中加载模块呢?谢谢.
I tried to load the module both in python 2.7 and 3.5, failed in the same way. So how can I load the module in gensim? Thanks.
推荐答案
该模块是由Java训练而成的大量汉字.我无法弄清楚原始语料库的编码格式.可以通过gensim
The module was tons of Chinese characters trained by Java. I cannot figure out the encoding format of the original corpus. The error can be solved as the description in gensim FAQ,
在load_word2vec_format中使用一个标志来忽略字符解码错误:
Using load_word2vec_format with a flag for ignoring the character decoding errors:
In [1]: import gensim
In [2]: model = gensim.models.Word2Vec.load_word2vec_format('/data5/momo-projects/user_interest_classification/code/word2vec/vectors_groups_1105.bin', binary=True, unicode_errors='ignore')
但是我不知道在忽略编码错误时是否重要.
But I've no idea whether it matters when ignoring the encoding errors.
这篇关于加载word2vec模块时出现'utf-8'解码错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!