Python pandas 将 csv ANSI 格式加载为 UTF-8 [英] Python pandas load csv ANSI Format as UTF-8

查看:199
本文介绍了Python pandas 将 csv ANSI 格式加载为 UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在 Jupyter Notebooks 中加载包含熊猫的 CSV 文件,其中包含 ä、ö、ü、ß 等字符.

I want to load a CSV File with pandas in Jupyter Notebooks which contains characters like ä,ö,ü,ß.

当我用 Notepad++ 打开 csv 文件时,这是一个导致 ANSI 格式问题的示例行:

When i open the csv file with Notepad++ here is one example row which causes trouble in ANSI Format:

Empf„nger;Empf„ngerStadt;Empf„ngerStraáe;Empf„ngerHausnr.;Empf„ngerPLZ;Empf„ngerLand

Empf?nger 的正确 UTF-8 结果应该是:Empfänger

The correct UTF-8 outcome for Empf„nger should be: Empfänger

现在,当我使用以下代码在 Windows 上的 Python 3.6 pandas 中加载 CSV 数据时:

Now when i load the CSV Data in Python 3.6 pandas on Windows with the following code:

df_a = pd.read_csv('file.csv',sep=';',encoding='utf-8')

我收到错误消息:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position xy: invalid continuation byte

Position 'xy' 是导致错误信息的字符出现的位置

Position 'xy' is the position where the character occurs that causes the error message

当我使用 ansi 格式加载我的 csv 文件时,它可以工作,但显示的元音不正确.

when i use the ansi format to load my csv file it works but display the umlaute incorrect.

示例代码:

df_a = pd.read_csv('afile.csv',sep=';',encoding='ANSI')

Empfänger 表示为:Empf„nger

Empfänger is represented as: Empf„nger

注意:我尝试在 Notepad++ 中将文件转换为 UTF-8,然后使用 pandas 模块加载它,但我仍然遇到相同的错误.

Note: i have tried to convert the file to UTF-8 in Notepad++ and load it afterwards with the pandas module but i still get the same error.

我在网上搜索了一个解决方案,但提供的解决方案例如将记事本++中的格式更改为 utf-8"或使用 encoding='UTF-8'"或latin1",这给了我与 ANSI 格式相同的结果或

I have searched online for a solution but the provided solutions such as "change format in notepad++ to utf-8" or "use encoding='UTF-8'" or 'latin1' which gives me the same result as ANSI format or

import chardet

with open('afile.csv', 'rb') as f:
    result = chardet.detect(f.readline())

df_a = pd.read_csv('afile.csv',sep=';',encoding=result['encoding'])

对我不起作用.

encoding='cp1252'

抛出以下异常:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2: character maps to <undefined>

之后我也尝试用 x.replace() 方法替换字符串,但字符 ü 在加载到 Pandas DataFrame 后完全消失

I also tried to replace Strings afterwards with the x.replace() method but the character ü disappears completely after loaded into a pandas DataFrame

推荐答案

如果你不知道你的文件编码是什么,我认为最快的方法是在文本编辑器上打开文件,比如 Notepad++ 检查您的文件是如何编码的.

If you don't know which are your file encoding, I think that the fastest approach is to open the file on a text editor, like Notepad++ to check how your file are encoding.

然后您转到 python 文档 并查看以使用正确的编解码器.

Then you go to the python documentation and look for the correct codec to use.

在你的情况下,ANSI,编解码器是mbcs",所以你的代码看起来像这些

In your case , ANSI, the codec is 'mbcs', so your code will look like these

df_a = pd.read_csv('file.csv',sep=';',encoding='mbcs')

这篇关于Python pandas 将 csv ANSI 格式加载为 UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆