Pandas的read_csv C引擎是否可以忽略或替换Unicode分析错误? [英] Is there any way for Pandas' read_csv C engine to ignore or replace Unicode parsing errors?
问题描述
关于在Python中从磁盘读取字符串的大多数问题都涉及编解码器问题.相比之下,我有一个CSV文件,其中刚刚包含垃圾数据.创建示例的方法如下:
Most questions around reading strings from disk in Python involve codec issues. In contrast, I have a CSV file that just flat out has garbage data in it. Here's how to create an example:
b = bytearray(b'a,b,c\n1,2,qwe\n10,-20,asdf')
b[10] = 0xff
b[11] = 0xff
with open('foo.csv', 'wb') as fid:
fid.write(b)
请注意,第二行第三列有两个字节0xFF
,它们不代表任何编码,只是少量的垃圾数据.
Note that the second row, third column has two bytes, 0xFF
, which don't represent any encoding, just a small amount of garbage data.
import pandas as pd
df = pd.read_csv('foo.csv') # fails
我自然会出错:
File "pandas/_libs/parsers.pyx", line 881, in pandas._libs.parsers.TextReader.read
...
File "pandas/_libs/parsers.pyx", line 1520, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
但是,如果我使用Pandas的Python CSV引擎,我可以成功读取此文件:
I can however successfully read this file if I use Pandas' Python CSV engine:
df2 = pd.read_csv('foo.csv', engine='python') # success
在这种情况下,无效字符被Unicode用来表示无效字符"的U+EFBF
字符替换.
In this case, the invalid characters are replaced with U+EFBF
characters that Unicode uses to represent "Invalid Character"s.
问题:Pandas的C CSV引擎有什么方法可以和此处的Python做相同的事情?
Question: is there any way for Pandas' C CSV engine to do the same thing as Python's here?
推荐答案
在对类似字节的对象进行编码时,用python引擎替换无效字符与errors='replace'
模式相对应.
The replacement of invalid characters you see with the python engine corresponds to the errors='replace'
mode when encoding a bytes-like object.
您可以使用任意单字节编码来读取csv,并以这种错误模式对列进行转码(将转换器传递到read_csv
或使用series.str.encode/decode
方法),但是这很麻烦,因为您必须识别一组特定的列.
You may read the csv using an arbitrary single-byte encoding and transcode columns with this error mode (passing a converter to read_csv
or using series.str.encode/decode
methods) but it's quite cumbersome since you have to identify a specific set of columns.
要获得全局效果,由于read_csv
不支持(但)errors
参数,因此您可以使用支持python的内置open
预打开文件.
For a global effect, since read_csv
does not support (yet) the errors
parameter, you can pre-open the file with the python built-in open
, which does support it.
df = pd.read_csv(open('foo.csv', errors='replace'))
这篇关于Pandas的read_csv C引擎是否可以忽略或替换Unicode分析错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!