重新编码Unicode流为Ascii,忽略错误 [英] Re-encode Unicode stream as Ascii ignoring errors

查看:83
本文介绍了重新编码Unicode流为Ascii,忽略错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试获取一个包含奇数字符的Unicode文件流,并使用流读取器将其包装,该读取器会将其转换为Ascii,而忽略或替换所有无法编码的字符。

I'm trying to take a Unicode file stream, which contains odd characters, and wrap it with a stream reader that will convert it to Ascii, ignoring or replacing all characters that can't be encoded.

我的流看起来像:

"EventId","Rate","Attribute1","Attribute2","(。・ω・。)ノ"
...

我试图即时更改流的方式如下:

My attempt to alter the stream on the fly looks like this:

import chardet, io, codecs

with open(self.csv_path, 'rb') as rawdata:
    detected = chardet.detect(rawdata.read(1000))

detectedEncoding = detected['encoding']
with io.open(self.csv_path, 'r', encoding=detectedEncoding) as csv_file:
    csv_ascii_stream = codecs.getreader('ascii')(csv_file, errors='ignore')
    log( csv_ascii_stream.read() )

log上的结果行是: UnicodeEncodeError:'ascii'编解码器无法对位置36-40中的字符进行编码:或日记不在范围内(128),即使我用 errors ='ignore'

我希望结果流(读取时)如下所示:

I would like the resulting stream (when read) to come out like this:

"EventId","Rate","Attribute1","Attribute2","(?????)?"
...

EventId,比率, Attribute1, Attribute2,() (使用'ignore'代替'replace '

为什么仍然发生例外?

我见过解码字符串有很多问题/解决方案,但是我的挑战是在读取流时更改流(使用 .next()),因为文件可能太大而无法可以使用 .read()

I've seen plenty of problems/solutions for decoding strings, but my challenge is to change the stream as it's being read (using .next()), because the file is potentially too large to be loaded into memory all at once using .read()

推荐答案

您正在混合编码和解码方面。

You're mixing up the encode and decode sides.

对于解码,您做的还不错。您可以将其作为二进制数据打开, chardet 前1K,然后使用检测到的编码以文本模式重新打开。

For decoding, you're doing fine. You open it as binary data, chardet the first 1K, then reopen in text mode using the detected encoding.

但是,然后您尝试通过使用 codecs.getreader 。该函数返回 StreamReader ,它解码流中的数据。那是行不通的。您需要将该数据编码到ASCII。

But then you're trying to further decode that already-decoded data as ASCII, by using codecs.getreader. That function returns a StreamReader, which decodes data from a stream. That isn't going to work. You need to encode that data to ASCII.

但不清楚您为什么使用编解码器流解码器编码器,当您要做的只是一次性编码单个文本块时,就可以对其进行记录。为什么不只调用 encode 方法?

But it's not clear why you're using a codecs stream decoder or encoder in the first place, when all you want to do is encode a single chunk of text in one go so you can log it. Why not just call the encode method?

log(csv_file.read().encode('ascii', 'ignore'))

如果您想要的东西可以用作行的懒惰迭代,您可以可以构建完全通用的内容,但是只需在代码中执行 UTF8Recorder 示例之类的东西就容易得多。 csv docs :

If you want something that you can use as a lazy iterable of lines, you could build something fully general, but it's a lot simpler to just do something like the UTF8Recorder example in the csv docs:

class AsciiRecoder:
    def __init__(self, f, encoding):
        self.reader = codecs.getreader(encoding)(f)    
    def __iter__(self):
        return self
    def next(self):
        return self.reader.next().encode("ascii", "ignore")

或更简单地说:

with io.open(self.csv_path, 'r', encoding=detectedEncoding) as csv_file:
    csv_ascii_stream = (line.encode('ascii', 'ignore') for line in csv_file)

这篇关于重新编码Unicode流为Ascii,忽略错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆