Pandas的read_csv C引擎是否可以忽略或替换Unicode分析错误? [英] Is there any way for Pandas' read_csv C engine to ignore or replace Unicode parsing errors?

查看:236
本文介绍了Pandas的read_csv C引擎是否可以忽略或替换Unicode分析错误?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于在Python中从磁盘读取字符串的大多数问题都涉及编解码器问题.相比之下,我有一个CSV文件,其中刚刚包含垃圾数据.创建示例的方法如下:

Most questions around reading strings from disk in Python involve codec issues. In contrast, I have a CSV file that just flat out has garbage data in it. Here's how to create an example:

b = bytearray(b'a,b,c\n1,2,qwe\n10,-20,asdf')
b[10] = 0xff
b[11] = 0xff
with open('foo.csv', 'wb') as fid:
    fid.write(b)

请注意,第二行第三列有两个字节0xFF,它们不代表任何编码,只是少量的垃圾数据.

Note that the second row, third column has two bytes, 0xFF, which don't represent any encoding, just a small amount of garbage data.

当我尝试使用 :

import pandas as pd
df = pd.read_csv('foo.csv') # fails

我自然会出错:

  File "pandas/_libs/parsers.pyx", line 881, in pandas._libs.parsers.TextReader.read
  ...
  File "pandas/_libs/parsers.pyx", line 1520, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

但是,如果我使用Pandas的Python CSV引擎,我可以成功读取此文件:

I can however successfully read this file if I use Pandas' Python CSV engine:

df2 = pd.read_csv('foo.csv', engine='python') # success

在这种情况下,无效字符被Unicode用来表示无效字符"的U+EFBF字符替换.

In this case, the invalid characters are replaced with U+EFBF characters that Unicode uses to represent "Invalid Character"s.

问题:Pandas的C CSV引擎有什么方法可以和此处的Python做相同的事情?

Question: is there any way for Pandas' C CSV engine to do the same thing as Python's here?

推荐答案

在对类似字节的对象进行编码时,用python引擎替换无效字符与errors='replace'模式相对应.

The replacement of invalid characters you see with the python engine corresponds to the errors='replace' mode when encoding a bytes-like object.

您可以使用任意单字节编码来读取csv,并以这种错误模式对列进行转码(将转换器传递到read_csv或使用series.str.encode/decode方法),但是这很麻烦,因为您必须识别一组特定的列.

You may read the csv using an arbitrary single-byte encoding and transcode columns with this error mode (passing a converter to read_csv or using series.str.encode/decode methods) but it's quite cumbersome since you have to identify a specific set of columns.

要获得全局效果,由于read_csv不支持(但)errors参数,因此您可以使用支持python的内置open预打开文件.

For a global effect, since read_csv does not support (yet) the errors parameter, you can pre-open the file with the python built-in open, which does support it.

df = pd.read_csv(open('foo.csv', errors='replace'))

这篇关于Pandas的read_csv C引擎是否可以忽略或替换Unicode分析错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆