pandas cut():如何转换nans?还是将输出转换为非分类? [英] pandas cut(): how to convert nans? Or to convert the output to non-categorical?

查看:152
本文介绍了 pandas cut():如何转换nans?还是将输出转换为非分类?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在带有nans的数据框列上使用pandas.cut().我需要在pandas.cut()的输出上运行groupby,因此我需要将nans转换为其他内容(在输出中,而不是在输入数据中),否则groupby将愚蠢而令人发指地忽略它们.

I am using pandas.cut() on dataframe columns with nans. I need to run groupby on the output of pandas.cut(), so I need to convert nans to something else (in the output, not in the input data), otherwise groupby will stupidly and infuriatingly ignore them.

我知道cut()现在可以输出分类数据,但是我找不到找到将分类添加到输出中的方法.我尝试过add_categories(),它运行时没有警告也没有错误,但是由于没有添加类别而无法正常工作,因此,fillna确实由于这个原因而失败.下面是一个极简主义的例子.

I understand that cut() now outputs categorical data, but I cannot find a way to add a category to the output. I have tried add_categories(), which runs with no warning nor errors, but doesn't work because the categories are not added and, indeed, fillna fails for this very reason. A minimalist example is below.

有什么想法吗?

或者也许有一种简单的方法可以将该分类对象转换为非分类对象?我试过np.asarray()但没有运气-它变成了一个包含Interval对象的数组

Or is there maybe an easy way to convert this categorical object to a non-categorical one? I have tried np.asarray() but with no luck - it becomes an array containing an Interval object

import pandas as pd
import numpy as np

x=[np.nan,4,6]
intervals =[-np.inf,4,np.inf]
out_nolabels=pd.cut(x,intervals)
out_labels=pd.cut(x,intervals, labels=['<=4','>4'])
out_nolabels.add_categories(['missing'])
out_labels.add_categories(['missing'])

print(out_labels)
print(out_nolabels)

out_labels=out_labels.fillna('missing')
out_nolabels=out_nolabels.fillna('missing')

PS这是关于熊猫如何成为处理丢失数据的最差工具的又一个问题.就像有人聚在一起思考:我们该如何使那些愚蠢的人难以使用Python和Pandas分析数据,让他们的生活变得更加艰难?我知道,让我们在没有警告的情况下从groupby中删除nans!

PS This is yet another question on how Pandas is the worst tool to handle missing data. It's like someone got together and thought: how can we make life harder for those who are stupid enough to analyse data with Python and Pandas? I know, let's remove nans from groupby, without even a warning!

推荐答案

正如文档所说,超出范围的数据将被视为Na分类对象,因此您不能在分类数据since the new value you are filling is not in that categories中使用带有某些常量的fillna' >

As the documentation say out of the bounds data will be consider as Na categorical object, so you cant use fillna's with some constant in categorical data since the new value you are filling is not in that categories

任何NA值在结果中均为NA.超出范围的值将是 所得分类对象中的不适用

Any NA values will be NA in the result. Out of bounds values will be NA in the resulting Categorical object

您不能使用x.fillna('missing'),因为missing不在x的类别中,但是您可以使用x.fillna('>4'),因为>4在该类别中.

You cant use x.fillna('missing') because missing is not in the category of x but you can do x.fillna('>4') because >4 is in the category.

我们可以在此处使用np.where来克服这一点

We can use np.where here to overcome that

x = pd.cut(df['id'],intervals, labels=['<=4','>4'])

np.where(x.isnull(),'missing',x)
array(['<=4', '<=4', '<=4', '<=4', 'missing', 'missing'], dtype=object)

add_categories设置为

x = pd.cut(df['id'],intervals, labels=['<=4','>4']).values.add_categories('missing')
x.fillna('missing')

[<=4, <=4, <=4, <=4, missing, missing]
Categories (3, object): [<=4 < >4 < missing]

如果要对nan进行分组并保留dtype的一种方法是将其强制转换为str,即如果您有数据框

If you want to group nan's and keep the dtype one way of doing it is by casting it to str i.e If you have a dataframe

df = pd.DataFrame({'id':[1,1,1,4,np.nan,np.nan],'value':[4,5,6,7,8,1]})

df.groupby(df.id.astype(str)).mean()

输出:


     id  value
id             
1.0  1.0    5.0
4.0  4.0    7.0
nan  NaN    4.5

这篇关于 pandas cut():如何转换nans?还是将输出转换为非分类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆