将read_excel与转换器一起用于将Excel文件读取到Pandas DataFrame中会导致对象类型的数字列 [英] Using read_excel with converters for reading Excel file into Pandas DataFrame results in a numeric column of object type

查看:370
本文介绍了将read_excel与转换器一起用于将Excel文件读取到Pandas DataFrame中会导致对象类型的数字列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读此Excel文件联合国能源指标在此处使用代码段:

I am reading this Excel file United Nations Energy Indicators using the code snippet here:

def convert_energy(energy):
    if isinstance(energy, float):
        return energy*1000000
    else:
        return energy

def energy_df():
    return pd.read_excel("Energy Indicators.xls", skiprows=17, skip_footer=38, usecols=[2,3,4,5], na_values=['...'], names=['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable'], converters={1: convert_energy}).set_index('Country')

这将导致能源供应列具有对象类型而不是浮点型.为什么会这样?

This results in Energy Supply column having the object type instead of float. Why is it the case?

energy = energy_df()
print(energy.dtypes)

Energy Supply                object
Energy Supply per Capita    float64
% Renewable                 float64

推荐答案

暂时删除converters参数-

c = ['Energy Supply', 'Energy Supply per Capita', '% Renewable']
df = pd.read_excel("Energy Indicators.xls", 
                   skiprows=17, 
                   skip_footer=38, 
                   usecols=[2,3,4,5], 
                   na_values=['...'], 
                   names=c,
                   index_col=[0])

df.index.name = 'Country'

df.head()    
                Energy Supply  Energy Supply per Capita  % Renewable
Country                                                             
Afghanistan             321.0                      10.0    78.669280
Albania                 102.0                      35.0   100.000000
Algeria                1959.0                      51.0     0.551010
American Samoa            NaN                       NaN     0.641026
Andorra                   9.0                     121.0    88.695650

df.dtypes

Energy Supply               float64
Energy Supply per Capita    float64
% Renewable                 float64
dtype: object

没有转换器,您的数据加载就很好.有一个技巧可以理解为什么会发生这种情况.

Your data loads just fine without a converter. There's a trick to understanding why this happens.

默认情况下,pandas将读入该列并尝试解释"您的数据.通过指定您自己的转换器,您可以覆盖熊猫转换,因此不会发生这种情况.

By default, pandas will read in the column and try to "interpret" your data. By specifying your own converter, you override pandas conversion, so this does not happen.

pandas将整数和字符串值传递给convert_energy,因此isinstance(energy, float)永远不会被评估为True.相反,else运行,并且这些值按原样返回,因此您得到的列是字符串和整数的混合.如果将print(type(energy))放在函数中,这将变得很明显.

pandas passes integer and string values to convert_energy, so the isinstance(energy, float) is never evaluated to True. Instead, the else runs, and these values are returned as is, so your resultant column is a mixture of strings and integers. If you put a print(type(energy)) inside your function, this becomes obvious.

由于您混合使用类型,因此结果类型为object.但是,如果您不使用转换器,熊猫将尝试解释您的数据,并将成功将其解析为数字.

Since you have mixtures of types, the resultant type is object. However, if you do not use a converter, pandas will attempt to interpret your data, and will successfully parse it to numeric.

所以,只需-

df['Energy Supply'] *= 1000000

会绰绰有余.

这篇关于将read_excel与转换器一起用于将Excel文件读取到Pandas DataFrame中会导致对象类型的数字列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆