Python处理一个csv文件以删除大于3个字节的unicode字符 [英] Python process a csv file to remove unicode characters greater than 3 bytes

查看:158
本文介绍了Python处理一个csv文件以删除大于3个字节的unicode字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Python 2.7.5,并尝试获取一个现有的CSV文件并对其进行处理,以删除大于3个字节的unicode字符. (将其发送给Mechanical Turk,这是Amazon的限制.)

I'm using Python 2.7.5 and trying to take an existing CSV file and process it to remove unicode characters that are greater than 3 bytes. (Sending this to Mechanical Turk, and it's an Amazon restriction.)

在此问题中,我尝试使用顶部(令人惊讶的)答案(

I've tried to use the top (amazing) answer in this question (How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?). I assume that I can just iterate through the csv row-by-row, and wherever I spot unicode characters of >3 bytes, replace them with a replacement character.

# -*- coding: utf-8 -*-
import csv
import re

re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
ifile  = open('sourcefile.csv', 'rU')
reader = csv.reader(ifile, dialect=csv.excel_tab)
ofile  = open('outputfile.csv', 'wb')
writer = csv.writer(ofile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)

#skip header row
next(reader, None)

for row in reader:
    writer.writerow([re_pattern.sub(u'\uFFFD', unicode(c).encode('utf8')) for c in row])

ifile.close()
ofile.close()

我当前遇到此错误:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 264: ordinal not in range(128)

因此,这确实可以正确地遍历某些行,但是在遇到奇怪的unicode字符时会停止.

So this does iterate properly through some rows, but stops when it gets to the strange unicode characters.

我真的很感谢一些指示;我很困惑.我已经用"latin1"和unicode(c).encode替换为"utf8",并将其替换为unicode(c).decode,但我仍然遇到同样的错误.

I'd really appreciate some pointers; I'm completely confused. I've replaced 'utf8' with 'latin1' and unicode(c).encode to unicode(c).decode and I keep getting this same error.

推荐答案

您的输入仍然是编码的数据,而不是Unicode值.您需要先将 decode 解码为unicode值,但没有指定要使用的编码.然后,您需要再次编码回到编码值,以写回输出CSV:

Your input is still encoded data, not Unicode values. You'd need to decode to unicode values first, but you didn't specify an encoding to use. You then need to encode again back to encoded values to write back to the output CSV:

writer.writerow([re_pattern.sub(u'\uFFFD', unicode(c, 'utf8')).encode('utf8')
                 for c in row])

您的错误源于unicode(c)调用;无需使用显式编解码器,Python会退回到默认的ASCII编解码器.

Your error stems from the unicode(c) call; without an explicit codec to use, Python falls back to the default ASCII codec.

如果将文件对象用作上下文管理器,则无需手动关闭它们:

If you use your file objects as context managers, there is no need to manually close them:

import csv
import re

re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)

def limit_to_BMP(value, patt=re_pattern):
    return patt.sub(u'\uFFFD', unicode(value, 'utf8')).encode('utf8')

with open('sourcefile.csv', 'rU') as ifile, open('outputfile.csv', 'wb') as ofile:
    reader = csv.reader(ifile, dialect=csv.excel_tab)
    writer = csv.writer(ofile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
    next(reader, None)  # header is not added to output file
    writer.writerows(map(limit_to_BMP, row) for row in reader)

我也将替换动作也移到了一个单独的函数上,并使用生成器表达式按需生成writer.writerows()函数的所有行.

I moved the replacement action to a separate function too, and used a generator expression to produce all rows on demand for the writer.writerows() function.

这篇关于Python处理一个csv文件以删除大于3个字节的unicode字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆