阅读excel文件时， pandas 数据帧和字符编码 [英] Pandas dataframe and character encoding when reading excel file

查看：265 发布时间：2017/9/7 0:08:13 python excel character-encoding pandas

本文介绍了阅读excel文件时， pandas 数据帧和字符编码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在阅读一个具有多个数字和分类数据的Excel文件。列名_string包含外语字符。
当我尝试看到name_string列的内容时，我得到我想要的结果，但外部字符（在excel电子表格中正确显示）显示错误的编码。
这是我有的：

I am reading an excel file that has several numerical and categorical data. The columns name_string contains characters in a foreign language. When I try to see the content of the name_string column, I get the results I want, but the foreign characters (that are displayed correctly in the excel spreadsheet) are displayed with the wrong encoding. Here is what I have:

import pandas as pd
df = pd.read_excel('MC_simulation.xlsx', 'DataSet', encoding='utf-8')
name_string = df.name_string.unique()
name_string.sort()
name_string

生成以下内容：

array([u'4th of July', u'911', u'Abab', u'Abass', u'Abcar', u'Abced',
       u'Ceded', u'Cedes', u'Cedfus', u'Ceding', u'Cedtim', u'Cedtol',
       u'Cedxer', u'Chevrolet Corvette', u'Chuck Norris',
       u'Cristina Fern\xe1ndez de Kirchner'], dtype=object)

在最后一行，正确编码的名称应为克里斯蒂娜·费尔南德斯·德基什纳有人可以帮我解决这个问题吗？

In the last line, the correctly encoded name should be Cristina Fernández de Kirchner. Can anybody help me with this issue?

推荐答案

实际上，正在将数据正确解析为 unicode ，而不是 strs 。 u 前缀表示对象是 unicode 。当打印列表，元组或NumPy数组时，Python会显示序列中项目的 repr 。因此，您不会看到 unicode 的打印版本，您会看到 repr ：

Actually, the data is being parsed correctly into unicode, not strs. The u prefix indicate that the objects are unicode. When a list, tuple, or NumPy array is printed, Python shows the repr of the items in the sequence. So instead of seeing the printed version of the unicode, you see the repr:

In [160]: repr(u'Cristina Fern\xe1ndez de Kirchner')
Out[160]: "u'Cristina Fern\\xe1ndez de Kirchner'"

In [156]: print(u'Cristina Fern\xe1ndez de Kirchner')
Cristina Fernández de Kirchner

repr 是为每个对象提供一个明确的字符串表示形式。 unicode的印刷版可能是不明确的，因为隐形或不可打印的字符。

The purpose of the repr is to provide an unambiguous string representation for each object. The printed verson of a unicode can be ambiguous because of invisible or unprintable characters.

然而，如果您打印DataFrame或Series，您将获得打印版本的unicodes：

If you print the DataFrame or Series, however, you'll get the printed version of the unicodes:

In [157]: df = pd.DataFrame({'foo':np.array([u'4th of July', u'911', u'Abab', u'Abass', u'Abcar', u'Abced',
       u'Ceded', u'Cedes', u'Cedfus', u'Ceding', u'Cedtim', u'Cedtol',
       u'Cedxer', u'Chevrolet Corvette', u'Chuck Norris',
       u'Cristina Fern\xe1ndez de Kirchner'], dtype=object)})
   .....:    .....:    .....: 
In [158]: df
Out[158]: 
                               foo
0                      4th of July
1                              911
2                             Abab
3                            Abass
4                            Abcar
5                            Abced
6                            Ceded
7                            Cedes
8                           Cedfus
9                           Ceding
10                          Cedtim
11                          Cedtol
12                          Cedxer
13              Chevrolet Corvette
14                    Chuck Norris
15  Cristina Fernández de Kirchner

[16 rows x 1 columns]

这篇关于阅读excel文件时， pandas 数据帧和字符编码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

阅读excel文件时， pandas 数据帧和字符编码 [英] Pandas dataframe and character encoding when reading excel file

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

阅读excel文件时， pandas 数据帧和字符编码 [英] Pandas dataframe and character encoding when reading excel file

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭