Pandas数据帧和字符编码时读取excel文件 [英] Pandas dataframe and character encoding when reading excel file

查看:366
本文介绍了Pandas数据帧和字符编码时读取excel文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在读一个excel文件,它有几个数字和分类数据。列name_string包含外语中的字符。
当我试图看到name_string列的内容,我得到我想要的结果,但外来字符(在excel电子表格中正确显示)显示与错误的编码。
这是我有的:

  import pandas as pd 
df = pd.read_excel .xlsx','DataSet',encoding ='utf-8')
name_string = df.name_string.unique()
name_string.sort()
name_string

生成以下内容:

 阵列([u'4th of July',u'911',u'Abab',u'Abass',u'Abcar',u'Abced',
u'Ceded',u'Cedes',u 'Cedfus',u'Ceding',u'Cedtim',u'Cedtol',
u'Cedxer',u'Chevrolet Corvette',u'Chuck Norris',
u'Cristina Fern \在最后一行,正确编码的名称应该是正确编码的,而不是正确编码的名称。克里斯蒂娜·费尔南德斯·德·基什内尔。任何人都可以帮助我这个问题?

解决方案

实际上,数据正确解析为unicode ,而不是 strs u 前缀表示对象是 unicode 。当打印list,tuple或NumPy数组时,Python会显示序列中项目的 repr 。因此,不会看到 unicode 的打印版本,您会看到 repr

 在[160]:repr(u'Cristina Fern \xe1ndez de Kirchner')
Out [160]:u'Cristina Fern \ \ xe1ndez de Kirchner'

在[156]:print(u'Cristina Fern \xe1ndez de Kirchner')
CristinaFernándezde Kirchner

repr 是为每个对象提供一个无歧义的字符串表示。 unicode的打印版本可能是不明确的,因为不可见或不可打印的字符。



但是,如果您打印DataFrame或Series,则会获得打印版本的unicode:

  In [157]:df = pd.DataFrame({'foo':np.array([u'4th of July',u'911',u'Abab',u' Abass',u'Abcar',u'Abced',
u'Ceded',u'Cedes',u'Cedfus',u'Ceding',u'Cedtim',u'Cedtol',
ucedxer',u'Chevrolet Corvette',u'Chuck Norris',
u'Cristina Fern \xe1ndez de Kirchner'],dtype = object)})
.....: .....:.....:
In [158]:df
Out [158]:
foo
0 7月4日
1 911
2 Abab
3 Abass
4 Abcar
5 Abced
6 Ceded
7 Cedes
8 Cedfus
9 Ceding
10 Cedtim
11 Cedtol
12 Cedxer
13雪佛兰科尔维特
14 Chuck Norris
15 CristinaFernándezde Kirchner

[16排x 1 columns]


I am reading an excel file that has several numerical and categorical data. The columns name_string contains characters in a foreign language. When I try to see the content of the name_string column, I get the results I want, but the foreign characters (that are displayed correctly in the excel spreadsheet) are displayed with the wrong encoding. Here is what I have:

import pandas as pd
df = pd.read_excel('MC_simulation.xlsx', 'DataSet', encoding='utf-8')
name_string = df.name_string.unique()
name_string.sort()
name_string

Producing the following:

array([u'4th of July', u'911', u'Abab', u'Abass', u'Abcar', u'Abced',
       u'Ceded', u'Cedes', u'Cedfus', u'Ceding', u'Cedtim', u'Cedtol',
       u'Cedxer', u'Chevrolet Corvette', u'Chuck Norris',
       u'Cristina Fern\xe1ndez de Kirchner'], dtype=object)

In the last line, the correctly encoded name should be Cristina Fernández de Kirchner. Can anybody help me with this issue?

解决方案

Actually, the data is being parsed correctly into unicode, not strs. The u prefix indicate that the objects are unicode. When a list, tuple, or NumPy array is printed, Python shows the repr of the items in the sequence. So instead of seeing the printed version of the unicode, you see the repr:

In [160]: repr(u'Cristina Fern\xe1ndez de Kirchner')
Out[160]: "u'Cristina Fern\\xe1ndez de Kirchner'"

In [156]: print(u'Cristina Fern\xe1ndez de Kirchner')
Cristina Fernández de Kirchner

The purpose of the repr is to provide an unambiguous string representation for each object. The printed verson of a unicode can be ambiguous because of invisible or unprintable characters.

If you print the DataFrame or Series, however, you'll get the printed version of the unicodes:

In [157]: df = pd.DataFrame({'foo':np.array([u'4th of July', u'911', u'Abab', u'Abass', u'Abcar', u'Abced',
       u'Ceded', u'Cedes', u'Cedfus', u'Ceding', u'Cedtim', u'Cedtol',
       u'Cedxer', u'Chevrolet Corvette', u'Chuck Norris',
       u'Cristina Fern\xe1ndez de Kirchner'], dtype=object)})
   .....:    .....:    .....: 
In [158]: df
Out[158]: 
                               foo
0                      4th of July
1                              911
2                             Abab
3                            Abass
4                            Abcar
5                            Abced
6                            Ceded
7                            Cedes
8                           Cedfus
9                           Ceding
10                          Cedtim
11                          Cedtol
12                          Cedxer
13              Chevrolet Corvette
14                    Chuck Norris
15  Cristina Fernández de Kirchner

[16 rows x 1 columns]

这篇关于Pandas数据帧和字符编码时读取excel文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆