Pandas数据帧和字符编码时读取excel文件 [英] Pandas dataframe and character encoding when reading excel file

查看：366 发布时间：2016/11/19 14:45:01 python excel character-encoding pandas

本文介绍了Pandas数据帧和字符编码时读取excel文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在读一个excel文件，它有几个数字和分类数据。列name_string包含外语中的字符。
当我试图看到name_string列的内容，我得到我想要的结果，但外来字符（在excel电子表格中正确显示）显示与错误的编码。
这是我有的：

  import pandas as pd 
 df = pd.read_excel .xlsx'，'DataSet'，encoding ='utf-8'）
 name_string = df.name_string.unique（）
 name_string.sort（）
 name_string

生成以下内容：

 阵列（[u'4th of July'，u'911'，u'Abab'，u'Abass'，u'Abcar'，u'Abced'，
 u'Ceded'，u'Cedes'，u 'Cedfus'，u'Ceding'，u'Cedtim'，u'Cedtol'，
 u'Cedxer'，u'Chevrolet Corvette'，u'Chuck Norris'，
 u'Cristina Fern \在最后一行，正确编码的名称应该是正确编码的，而不是正确编码的名称。克里斯蒂娜·费尔南德斯·德·基什内尔。任何人都可以帮助我这个问题？ 
解决方案
实际上，数据正确解析为unicode ，而不是 strs 。  u 前缀表示对象是 unicode 。当打印list，tuple或NumPy数组时，Python会显示序列中项目的 repr 。因此，不会看到 unicode 的打印版本，您会看到 repr ：
 在[160]：repr（u'Cristina Fern \xe1ndez de Kirchner'）
 Out [160]：u'Cristina Fern \ \ xe1ndez de Kirchner'
 
在[156]：print（u'Cristina Fern \xe1ndez de Kirchner'）
 CristinaFernándezde Kirchner 
  
   repr  是为每个对象提供一个无歧义的字符串表示。 unicode的打印版本可能是不明确的，因为不可见或不可打印的字符。 
 
 
 但是，如果您打印DataFrame或Series，则会获得打印版本的unicode：
  In [157]：df = pd.DataFrame（{'foo'：np.array（[u'4th of July'，u'911'，u'Abab'，u' Abass'，u'Abcar'，u'Abced'，
 u'Ceded'，u'Cedes'，u'Cedfus'，u'Ceding'，u'Cedtim'，u'Cedtol'，
 ucedxer'，u'Chevrolet Corvette'，u'Chuck Norris'，
 u'Cristina Fern \xe1ndez de Kirchner']，dtype = object）}）
 .....： .....：.....：
 In [158]：df 
 Out [158]：
 foo 
 0 7月4日
 1 911 
 2 Abab 
 3 Abass 
 4 Abcar 
 5 Abced 
 6 Ceded 
 7 Cedes 
 8 Cedfus 
 9 Ceding 
 10 Cedtim 
 11 Cedtol 
 12 Cedxer 
 13雪佛兰科尔维特
 14 Chuck Norris 
 15 CristinaFernándezde Kirchner 
 
 [16排x 1 columns] 
  
 
I am reading an excel file that has several numerical and categorical data. The columns name_string contains characters in a foreign language. 
When I try to see the content of the name_string column, I get the results I want, but the foreign characters (that are displayed correctly in the excel spreadsheet) are displayed with the wrong encoding.
Here is what I have:
import pandas as pd
df = pd.read_excel('MC_simulation.xlsx', 'DataSet', encoding='utf-8')
name_string = df.name_string.unique()
name_string.sort()
name_string
Producing the following:
array([u'4th of July', u'911', u'Abab', u'Abass', u'Abcar', u'Abced',
       u'Ceded', u'Cedes', u'Cedfus', u'Ceding', u'Cedtim', u'Cedtol',
       u'Cedxer', u'Chevrolet Corvette', u'Chuck Norris',
       u'Cristina Fern\xe1ndez de Kirchner'], dtype=object)
In the last line, the correctly encoded name should be Cristina Fernández de Kirchner. Can anybody help me with this issue? 
 解决方案 
Actually, the data is being parsed correctly into unicode, not strs. The u prefix indicate that the objects are unicode. When a list, tuple, or NumPy array is printed, Python shows the repr of the items in the sequence. So instead of seeing the printed version of the unicode, you see the repr:
In [160]: repr(u'Cristina Fern\xe1ndez de Kirchner')
Out[160]: "u'Cristina Fern\\xe1ndez de Kirchner'"

In [156]: print(u'Cristina Fern\xe1ndez de Kirchner')
Cristina Fernández de Kirchner
The purpose of the repr is to provide an unambiguous string representation for each object. The printed verson of a unicode can be ambiguous because of invisible or unprintable characters. 

If you print the DataFrame or Series, however, you'll get the printed version of the unicodes:
In [157]: df = pd.DataFrame({'foo':np.array([u'4th of July', u'911', u'Abab', u'Abass', u'Abcar', u'Abced',
       u'Ceded', u'Cedes', u'Cedfus', u'Ceding', u'Cedtim', u'Cedtol',
       u'Cedxer', u'Chevrolet Corvette', u'Chuck Norris',
       u'Cristina Fern\xe1ndez de Kirchner'], dtype=object)})
   .....:    .....:    .....: 
In [158]: df
Out[158]: 
                               foo
0                      4th of July
1                              911
2                             Abab
3                            Abass
4                            Abcar
5                            Abced
6                            Ceded
7                            Cedes
8                           Cedfus
9                           Ceding
10                          Cedtim
11                          Cedtol
12                          Cedxer
13              Chevrolet Corvette
14                    Chuck Norris
15  Cristina Fernández de Kirchner

[16 rows x 1 columns]


                        
这篇关于Pandas数据帧和字符编码时读取excel文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

Pandas数据帧和字符编码时读取excel文件 [英] Pandas dataframe and character encoding when reading excel file

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Pandas数据帧和字符编码时读取excel文件 [英] Pandas dataframe and character encoding when reading excel file

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭