错误:“ utf8”编解码器无法解码位置0的字节0x80:无效的起始字节 [英] Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
问题描述
我正在尝试执行以下 kaggleassignmnet 。我正在使用gensim包来使用word2vec。我能够创建模型并将其存储到磁盘。但是,当我尝试重新加载文件时,出现以下错误。
I am trying to do the following kaggle assignmnet. I am using gensim package to use word2vec. I am able to create the model and store it to disk. But when I am trying to load the file back I am getting the error below.
-HP-dx2280-MT-GR541AV:~$ python prog_w2v.py
Traceback (most recent call last):
File "prog_w2v.py", line 7, in <module>
models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 579, in load_word2vec_format
header = utils.to_unicode(fin.readline())
File "/usr/local/lib/python2.7/dist-packages/gensim/utils.py", line 190, in any2unicode
return unicode(text, encoding, errors=errors)
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
我发现了类似的问题。但是我无法解决问题。我的prog_w2v.py如下。
I find similar question. But I was unable to solve the problem. My prog_w2v.py is as below.
import gensim
import time
start = time.time()
models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)
end = time.time()
print end-start," seconds"
我正在尝试使用在此处编码。该程序大约需要半小时才能生成模型。因此,我无法多次运行它来调试它。
I am trying to generate the model using code here. The program takes about half an hour to generate the model. Hence I am unable to run it many times to debug it.
推荐答案
您没有正确加载文件。您应该使用load()而不是load_word2vec_format()。
当您使用C代码训练模型并将模型保存为二进制格式时,将使用后者。但是,您不是以二进制格式保存模型,而是使用python对其进行训练。因此,您可以简单地使用以下代码,它应该可以正常工作:
You are not loading the file correctly. You should use load() instead of load_word2vec_format(). The latter is used when you train a model using the C code, and save the model in a binary format. However you are not saving the model in a binary format, and are training it using python. So you can simply use the following code and it should work:
models = gensim.models.Word2Vec.load('300features_40minwords_10context.txt')
这篇关于错误:“ utf8”编解码器无法解码位置0的字节0x80:无效的起始字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!