Python 2.7 上的 UnicodeDecodeError [英] UnicodeDecodeError on Python 2.7

查看:47
本文介绍了Python 2.7 上的 UnicodeDecodeError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

遇到一些问题.我正在对长度为 160 万的数据集进行 TwitterSentimentAnalysis.由于我的电脑无法完成工作(由于计算量太大),教授告诉我使用大学服务器.

Having some problems. I'm doing a TwitterSentimentAnalysis on a dataset of length 1.6 million. Since my pc could not do the work (due to so many computations), the professor told me to use the university server.

我刚刚意识到在服务器上,python 版本是 2.7,它不允许我在 csv reader 中使用参数 encoding 来读取文件.

I just realiazed that on the server, python version is 2.7 that it does not allow me to use the parameter encoding in csv reader for reading the file.

每当我收到 UnicodeDecodeError 时,我都必须从数据集中手动删除推文,否则我无法完成其余的工作.我已经尝试继续解决网站上的所有问题,但我什么也没解决.

Anytime I got the UnicodeDecodeError, I have to manually remove the tweet from the dataset otherwise I can't do the rest. I have tried to go on all the question on the site but I resolved nothing.

我只想跳过提出错误的那一行,因为集合足够大,可以让我进行良好的分析.

I just want to skip the line who raises the error, since the set is big enough to allow me a good analysis.

class UTF8Recoder:
    def __init__(self, f, encoding):
        self.reader = codecs.getreader(encoding)(f)
    def __iter__(self):
        return self
    def next(self):
        return self.reader.next().encode("utf-8", errors='ignore')

class UnicodeReader:
    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        f = UTF8Recoder(f, encoding)
        self.reader = csv.reader(f, dialect=dialect, **kwds)
    def next(self):
        '''next() -> unicode
        This function reads and returns the next line as a Unicode string.
        '''
        row = self.reader.next()
        return [unicode(s, "utf-8", errors='replace') for s in row]
    def __iter__(self):
        return self

def extraction(file, textCol, sentimentCol):
    "The function reads the tweets"
    #fp = open(file, "r",encoding="utf8")
    fp = open(file, "r")
    tweetreader = UnicodeReader(fp)
    #tweetreader = csv.reader( fp, delimiter=',', quotechar='"', escapechar='\\' )
    tweets = []
    for row in tweetreader:
        # It takes the column in which the tweets and the sentiment are
        if row[sentimentCol]=='positive' or row[sentimentCol]=='4':
            tweets.append([remove_stopwords(row[textCol]), 'positive']);
        else:
            if row[sentimentCol]=='negative' or row[sentimentCol]=='0':
                tweets.append([remove_stopwords(row[textCol]), 'negative']);
            else:
               if row[sentimentCol]=='irrilevant' or row[sentimentCol]=='2' or row[sentimentCol]=='neutral':
                   tweets.append([remove_stopwords(row[textCol]), 'neutral']);

    tweets = filterWords(tweets)
    fp.close()
    return tweets;

错误:

Traceback (most recent call last):
  File "sentimentAnalysis_v4.py", line 165, in <module>
    newTweets = extraction("sentiment2.csv",5,0)
  File "sentimentAnalysis_v4.py", line 47, in extraction
    for row in tweetreader:
  File "sentimentAnalysis_v4.py", line 29, in next
    row = self.reader.next()
  File "sentimentAnalysis_v4.py", line 19, in next
    return self.reader.next().encode("utf-8", errors='ignore')
  File "/usr/lib/python2.7/codecs.py", line 615, in next
    line = self.readline()
  File "/usr/lib/python2.7/codecs.py", line 530, in readline
    data = self.read(readsize, firstline=True)
  File "/usr/lib/python2.7/codecs.py", line 477, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd9 in position 48: invalid continuation byte

推荐答案

如果您输入的数据格式错误,我不会在这里使用 codecs 来读取.

If you have input data that is malformed, I'd not use codecs here to do the reading.

使用较新的io.open() function 并指定错误处理策略;'replace' 应该这样做:

Use the newer io.open() function and specify a error handling strategy; 'replace' should do:

class ForgivingUTF8Recoder:
    def __init__(self, filename, encoding):
        self.reader = io.open(f, newline='', encoding=encoding, errors='replace')
    def __iter__(self):
        return self
    def next(self):
        return self.reader.next().encode("utf-8", errors='ignore')

我将 newline 处理设置为 '' 以确保 CSV 模块正确处理值中的换行符.

I set the newline handling to '' to make sure the CSV module gets to handle newlines in values correctly.

不是传入一个打开的文件,而是传入文件名:

Instead of passing in an open file, just pass in the filename:

tweetreader = UnicodeReader(file)

这不会让您跳过错误行,而是通过替换无法使用 U+FFFD 替换字符;如果您想跳过整行,您仍然可以在列中查找那些.

This won't let you skip faulty lines, it instead will handle faulty lines by replacing characters that cannot be decoded with the U+FFFD REPLACEMENT CHARACTER; you can still look for those in your columns if you want to skip the whole row.

这篇关于Python 2.7 上的 UnicodeDecodeError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆