Python pandas 将 csv ANSI 格式加载为 UTF-8 [英] Python pandas load csv ANSI Format as UTF-8
问题描述
我想在 Jupyter Notebooks 中加载包含熊猫的 CSV 文件,其中包含 ä、ö、ü、ß 等字符.
I want to load a CSV File with pandas in Jupyter Notebooks which contains characters like ä,ö,ü,ß.
当我用 Notepad++ 打开 csv 文件时,这是一个导致 ANSI 格式问题的示例行:
When i open the csv file with Notepad++ here is one example row which causes trouble in ANSI Format:
Empf„nger;Empf„ngerStadt;Empf„ngerStraáe;Empf„ngerHausnr.;Empf„ngerPLZ;Empf„ngerLand
Empf?nger 的正确 UTF-8 结果应该是:Empfänger
The correct UTF-8 outcome for Empf„nger should be: Empfänger
现在,当我使用以下代码在 Windows 上的 Python 3.6 pandas 中加载 CSV 数据时:
Now when i load the CSV Data in Python 3.6 pandas on Windows with the following code:
df_a = pd.read_csv('file.csv',sep=';',encoding='utf-8')
我收到错误消息:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position xy: invalid continuation byte
Position 'xy' 是导致错误信息的字符出现的位置
Position 'xy' is the position where the character occurs that causes the error message
当我使用 ansi 格式加载我的 csv 文件时,它可以工作,但显示的元音不正确.
when i use the ansi format to load my csv file it works but display the umlaute incorrect.
示例代码:
df_a = pd.read_csv('afile.csv',sep=';',encoding='ANSI')
Empfänger 表示为:Empf„nger
Empfänger is represented as: Empf„nger
注意:我尝试在 Notepad++ 中将文件转换为 UTF-8,然后使用 pandas 模块加载它,但我仍然遇到相同的错误.
Note: i have tried to convert the file to UTF-8 in Notepad++ and load it afterwards with the pandas module but i still get the same error.
我在网上搜索了一个解决方案,但提供的解决方案例如将记事本++中的格式更改为 utf-8"或使用 encoding='UTF-8'"或latin1",这给了我与 ANSI 格式相同的结果或
I have searched online for a solution but the provided solutions such as "change format in notepad++ to utf-8" or "use encoding='UTF-8'" or 'latin1' which gives me the same result as ANSI format or
import chardet
with open('afile.csv', 'rb') as f:
result = chardet.detect(f.readline())
df_a = pd.read_csv('afile.csv',sep=';',encoding=result['encoding'])
对我不起作用.
encoding='cp1252'
抛出以下异常:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2: character maps to <undefined>
之后我也尝试用 x.replace()
方法替换字符串,但字符 ü 在加载到 Pandas DataFrame 后完全消失
I also tried to replace Strings afterwards with the x.replace()
method but the character ü disappears completely after loaded into a pandas DataFrame
推荐答案
如果你不知道你的文件编码是什么,我认为最快的方法是在文本编辑器上打开文件,比如 Notepad++ 检查您的文件是如何编码的.
If you don't know which are your file encoding, I think that the fastest approach is to open the file on a text editor, like Notepad++ to check how your file are encoding.
然后您转到 python 文档 并查看以使用正确的编解码器.
Then you go to the python documentation and look for the correct codec to use.
在你的情况下,ANSI,编解码器是mbcs",所以你的代码看起来像这些
In your case , ANSI, the codec is 'mbcs', so your code will look like these
df_a = pd.read_csv('file.csv',sep=';',encoding='mbcs')
这篇关于Python pandas 将 csv ANSI 格式加载为 UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!