错误:“ utf8”编解码器无法解码位置0的字节0x80:无效的起始字节 [英] Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

查看:326
本文介绍了错误:“ utf8”编解码器无法解码位置0的字节0x80:无效的起始字节的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试执行以下 kaggleassignmnet 。我正在使用gensim包来使用word2vec。我能够创建模型并将其存储到磁盘。但是,当我尝试重新加载文件时,出现以下错误。

I am trying to do the following kaggle assignmnet. I am using gensim package to use word2vec. I am able to create the model and store it to disk. But when I am trying to load the file back I am getting the error below.

    -HP-dx2280-MT-GR541AV:~$ python prog_w2v.py 
Traceback (most recent call last):
  File "prog_w2v.py", line 7, in <module>
    models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 579, in load_word2vec_format
    header = utils.to_unicode(fin.readline())
  File "/usr/local/lib/python2.7/dist-packages/gensim/utils.py", line 190, in any2unicode
    return unicode(text, encoding, errors=errors)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

我发现了类似的问题。但是我无法解决问题。我的prog_w2v.py如下。

I find similar question. But I was unable to solve the problem. My prog_w2v.py is as below.

import gensim
import time
start = time.time()    
models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True) 
end = time.time()   
print end-start,"   seconds"

我正在尝试使用在此处编码。该程序大约需要半小时才能生成模型。因此,我无法多次运行它来调试它。

I am trying to generate the model using code here. The program takes about half an hour to generate the model. Hence I am unable to run it many times to debug it.

推荐答案

您没有正确加载文件。您应该使用load()而不是load_word2vec_format()。
当您使用C代码训练模型并将模型保存为二进制格式时,将使用后者。但是,您不是以二进制格式保存模型,而是使用python对其进行训练。因此,您可以简单地使用以下代码,它应该可以正常工作:

You are not loading the file correctly. You should use load() instead of load_word2vec_format(). The latter is used when you train a model using the C code, and save the model in a binary format. However you are not saving the model in a binary format, and are training it using python. So you can simply use the following code and it should work:

models = gensim.models.Word2Vec.load('300features_40minwords_10context.txt')

这篇关于错误:“ utf8”编解码器无法解码位置0的字节0x80:无效的起始字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆