将pandas列(包含float和NaN值)从float64转换为可为null的int8 [英] Convert pandas column (containing floats and NaN values) from float64 to nullable int8

查看:601
本文介绍了将pandas列(包含float和NaN值)从float64转换为可为null的int8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个看起来像这样的大数据框:

I have a large dataframe that looks somewhat like this:

    a   b   c
0   2.2 6.0 0.0
1   3.3 7.0 NaN
2   4.4 NaN 3.0
3   5.5 9.0 NaN

b和c列包含的浮点值可以是正数,自然数或NaN.但是,它们存储为float64,这是一个问题,因为(无需进一步详细介绍)此数据帧是要求它们为整数的管道的输入,因此,我想这样存储它们.输出应如下所示:

Columns b and c contain float values that are either postive, natural numbers or NaN. However, they are stored as float64, which is a problem, since (without going into further detail) this dataframe is the input of a pipeline that requires these to be integers, so and I want to store them as such. The output should look like this:

    a   b   c
0   2.2 6   0
1   3.3 7   NaN
2   4.4 NaN 3
3   5.5 9   NaN

我在pandas文档中读到,pandas数据类型"Int8"仅支持可为空的整数(注意:这与np.int8不同),因此,我自然尝试了这一点:

I read in the pandas documentation that nullable integers are only supported in the pandas datatype "Int8" (note: this is different from np.int8), so naturally, I attempted this:

df = df.astype({'b':pd.Int8Dtype(), 'c':pd.Int8Dtype()})

当我在Jupyter笔记本中运行它时,此方法有效,但是当我将其集成到较大的功能中时,出现此错误:

This works when I run it in my Jupyter notebook, but when I integrate it within a larger function, I get this error:

TypeError: cannot safely cast non-equivalent float64 to int8

我理解为什么会出错,因为x == int(x)对于NaN值将为False,因此该程序认为此转换是不安全的,即使所有值均为NaN或自然数也是如此.所以接下来,我尝试了:

I understand why I get the error, since x == int(x), will be False for NaN values, so the program thinks this conversion is unsafe, even though all values are either NaN or natural number. So next, I tried:

'df = df.astype({'b':pd.Int8Dtype(), 'c':pd.Int8Dtype()}, errors='ignore')

我认为这可以解决不安全转换"问题,因为我100%确信所有float64值都是自然数.但是,当我使用此行时,我所有的数字仍然存储为浮点数!真气!

I figured that this would get rid of the 'unsafe conversion' problem, since I am 100% sure all float64 values are natural numbers. However, when I use this line, all of my numbers are still stored as floats! Infuriating!

有人对此有解决方法吗?

Does anyone have a workaround for this?

推荐答案

我遇到了与导致该页面完全相同的问题.对于这个问题,我没有一个真正好的解决方案,我正在为自己寻找一个……但是我确实找到了一种解决方法.在开始之前,我想回答关于原始问题的评论:尝试将NA或什至None值分配给一系列诸如int8这样的简单"类型的值,是尝试的重点.进行这些dtype转换.可以对一系列这些dtypes执行典型的操作,例如isna()(依此类推)(请参阅pd.Int X Dtype(),其中" X "代表位数).我通过使用这些dtype探索的优势是在内存占用上,例如:

I ran into exactly the same issue which led me to this page. I do not have a genuinely good solution for this issue and am seeking for one myself... but I did find a workaround. Before going into that I would like to answer to the comment posted on the original question that: allowing to have NA or even None values assigned to series of such 'simple' types as int8 is the whole point of trying to make these dtype conversions. It is possible to perform the typical operations such as isna() (and so on) on series of these dtypes (see pd.IntXDtype() where 'X' stands for the number of bits). The advantage I explore by using these dtypes is on memory footprint, eg:

In[56]: test_df = pd.Series(np.zeros(1_000_000), dtype=np.float64)

In[57]: test_df.memory_usage()
Out[57]: 8000128

In[58]: test_df = pd.Series(np.zeros(1_000_000), dtype=pd.Int8Dtype())

In[59]: test_df.memory_usage()
Out[59]: 2000128

In[60]: test_df.iloc[:500_000] = None

In[61]: test_df.memory_usage()
Out[61]: 2000128

In[62]: test_df.isna().sum()
Out[62]: 500000

因此,您可以两全其美.

So you get the best of both worlds.

现在是工作环境:

In[33]: my_df
Out[33]: 
     a    s      d
0    0 -500 -1.000
1    1 -499 -0.998
2    2 -498 -0.996
3    3 -497 -0.994
4    4 -496 -0.992

In[34]: my_df.dtypes
Out[34]: 
a      int64
s      int64
d    float64
dtype: object

In[35]: df_converted_to_int_first = my_df.astype(
   ...:     dtype={
   ...:         'a': np.int8,
   ...:         's': np.int16,
   ...:         'd': np.float16,
   ...:     },
   ...: )

In[36]: df_converted_to_int_first
Out[36]: 
     a    s         d
0    0 -500 -1.000000
1    1 -499 -0.998047
2    2 -498 -0.996094
3    3 -497 -0.994141
4    4 -496 -0.992188

In[37]: df_converted_to_int_first.dtypes
Out[37]: 
a       int8
s      int16
d    float16
dtype: object

In[38]: df_converted_to_special_int_after = df_converted_to_int_first.astype(
   ...:     dtype={
   ...:         'a': pd.Int8Dtype(),
   ...:         's': pd.Int16Dtype(),
   ...:     }
   ...: )

In[39]: df_converted_to_special_int_after.dtypes
Out[39]: 
a       Int8
s      Int16
d    float16
dtype: object

In[40]: df_converted_to_special_int_after.a.iloc[3] = None

In[41]: df_converted_to_special_int_after
Out[41]: 
       a     s         d
0      0  -500 -1.000000
1      1  -499 -0.998047
2      2  -498 -0.996094
3   <NA>  -497 -0.994141
4      4  -496 -0.992188

在我看来,这仍然不是可接受的解决方案...但是如上所述,ir构成了最初问题中提出的解决方法.

This is still not an acceptable solution in my opinion... but as mentioned above ir constitutes a workaround which is asked in the original question.

编辑 从np.float64到pd.Int8Dtype()缺少一些测试:

EDIT Some test that was missing, from np.float64 to pd.Int8Dtype():

In[67]: my_df.astype(
   ...:     dtype={
   ...:         'a': np.int8,
   ...:         's': np.int16,
   ...:         'd': np.int16,
   ...:     },
   ...: ).astype(    
   ...:     dtype={
   ...:         'a': np.int8,
   ...:         's': np.int16,
   ...:         'd': pd.Int8Dtype(),
   ...:     },
   ...: ).dtypes

Out[67]: 
a     int8
s    int16
d     Int8
dtype: object

这篇关于将pandas列(包含float和NaN值)从float64转换为可为null的int8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆