错误：“ utf8”编解码器无法解码位置0的字节0x80：无效的起始字节 [英] Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

查看：326 发布时间：2020/10/1 0:33:53 python character-encoding gensim word2vec kaggle

本文介绍了错误：“ utf8”编解码器无法解码位置0的字节0x80：无效的起始字节的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试执行以下 kaggleassignmnet 。我正在使用gensim包来使用word2vec。我能够创建模型并将其存储到磁盘。但是，当我尝试重新加载文件时，出现以下错误。

I am trying to do the following kaggle assignmnet. I am using gensim package to use word2vec. I am able to create the model and store it to disk. But when I am trying to load the file back I am getting the error below.

    -HP-dx2280-MT-GR541AV:~$ python prog_w2v.py 
Traceback (most recent call last):
  File "prog_w2v.py", line 7, in <module>
    models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 579, in load_word2vec_format
    header = utils.to_unicode(fin.readline())
  File "/usr/local/lib/python2.7/dist-packages/gensim/utils.py", line 190, in any2unicode
    return unicode(text, encoding, errors=errors)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

我发现了类似的问题。但是我无法解决问题。我的prog_w2v.py如下。

I find similar question. But I was unable to solve the problem. My prog_w2v.py is as below.

import gensim
import time
start = time.time()    
models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True) 
end = time.time()   
print end-start,"   seconds"

我正在尝试使用在此处编码。该程序大约需要半小时才能生成模型。因此，我无法多次运行它来调试它。

I am trying to generate the model using code here. The program takes about half an hour to generate the model. Hence I am unable to run it many times to debug it.

推荐答案

您没有正确加载文件。您应该使用load（）而不是load_word2vec_format（）。
当您使用C代码训练模型并将模型保存为二进制格式时，将使用后者。但是，您不是以二进制格式保存模型，而是使用python对其进行训练。因此，您可以简单地使用以下代码，它应该可以正常工作：

You are not loading the file correctly. You should use load() instead of load_word2vec_format(). The latter is used when you train a model using the C code, and save the model in a binary format. However you are not saving the model in a binary format, and are training it using python. So you can simply use the following code and it should work:

models = gensim.models.Word2Vec.load('300features_40minwords_10context.txt')

这篇关于错误：“ utf8”编解码器无法解码位置0的字节0x80：无效的起始字节的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

错误：“ utf8”编解码器无法解码位置0的字节0x80：无效的起始字节 [英] Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

错误：“ utf8”编解码器无法解码位置0的字节0x80：无效的起始字节 [英] Error: &#39;utf8&#39; codec can&#39;t decode byte 0x80 in position 0: invalid start byte

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

错误：“ utf8”编解码器无法解码位置0的字节0x80：无效的起始字节 [英] Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

登录关闭