UnicodeDecodeError: 'utf8' 编解码器无法解码位置 3131 中的字节 0x80:起始字节无效 [英] UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte

查看:55
本文介绍了UnicodeDecodeError: 'utf8' 编解码器无法解码位置 3131 中的字节 0x80:起始字节无效的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 python 2.7.12 从 json 文件中读取 twitter 数据.

I am trying to read twitter data from json file using python 2.7.12.

我使用的代码是这样的:

Code I used is such:

    import json
    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')

    def get_tweets_from_file(file_name):
        tweets = []
        with open(file_name, 'rw') as twitter_file:
            for line in twitter_file:
                if line != '
':
                    line = line.encode('ascii', 'ignore')
                    tweet = json.loads(line)
                    if u'info' not in tweet.keys():
                        tweets.append(tweet)
    return tweets

我得到的结果:

    Traceback (most recent call last):
      File "twitter_project.py", line 100, in <module>
        main()                  
      File "twitter_project.py", line 95, in main
        tweets = get_tweets_from_dir(src_dir, dest_dir)
      File "twitter_project.py", line 59, in get_tweets_from_dir
        new_tweets = get_tweets_from_file(file_name)
      File "twitter_project.py", line 71, in get_tweets_from_file
        line = line.encode('ascii', 'ignore')
    UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte

我浏览了类似问题的所有答案,并得出了这段代码,并且上次有效.我不知道为什么它现在不起作用......我将不胜感激!

I went through all the answers from similar issues and came up with this code and it worked last time. I have no clue why it isn't working now...I would appreciate any help!

推荐答案

拥有 sys.setdefaultencoding('utf-8') 并没有帮助,这让事情更加混乱 - 它是一个讨厌的黑客,你需要从你的代码中删除它.请参阅https://stackoverflow.com/a/34378962/1554386了解详情

It doesn't help that you have sys.setdefaultencoding('utf-8'), which is confusing things further - It's a nasty hack and you need to remove it from your code. See https://stackoverflow.com/a/34378962/1554386 for more information

发生错误是因为 line 是一个字符串,而您正在调用 encode().encode() 仅当字符串是 Unicode 时才有意义,因此 Python 尝试首先使用默认编码将其转换为 Unicode,在您的情况下为 UTF-8,但是应该是 ASCII.无论哪种方式,0x80 都不是有效的 ASCII 或 UTF-8,因此失败.

The error is happening because line is a string and you're calling encode(). encode() only makes sense if the string is a Unicode, so Python tries to convert it Unicode first using the default encoding, which in your case is UTF-8, but should be ASCII. Either way, 0x80 is not valid ASCII or UTF-8 so fails.

0x80 在某些字符集中有效.在 windows-1252/cp1252 中是 .

0x80 is valid in some characters sets. In windows-1252/cp1252 it's .

这里的诀窍是在整个代码中了解数据的编码.目前,你把太多的机会留给了机会.Unicode 字符串类型是一个方便的 Python 功能,它允许您解码编码的字符串并忘记编码,直到您需要写入或传输数据.

The trick here is to understand the encoding of your data all the way through your code. At the moment, you're leaving too much up to chance. Unicode String types are a handy Python feature that allows you to decode encoded Strings and forget about the encoding until you need to write or transmit the data.

使用 io 模块以文本模式打开文件并在文件进行时解码 - 不再需要 .decode()!您需要确保传入数据的编码一致.您可以在外部对其重新编码,也可以更改脚本中的编码.这里我将编码设置为 windows-1252.

Use the io module to open the file in text mode and decode the file as it goes - no more .decode()! You need to make sure the encoding of your incoming data is consistent. You can either re-encode it externally or change the encoding in your script. Here's I've set the encoding to windows-1252.

with io.open(file_name, 'r', encoding='windows-1252') as twitter_file:
    for line in twitter_file:
        # line is now a <type 'unicode'>
        tweet = json.loads(line)

io 模块还提供通用换行符.这意味着 被检测为换行符,因此您不必注意它们.

The io module also provide Universal Newlines. This means are detected as newlines, so you don't have to watch for them.

这篇关于UnicodeDecodeError: 'utf8' 编解码器无法解码位置 3131 中的字节 0x80:起始字节无效的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆