Pandas分类数据类型的行为不符合预期 [英] Pandas Categorical data type not behaving as expected

查看:189
本文介绍了Pandas分类数据类型的行为不符合预期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我下面有Pandas(0.15.2版)数据框.我想在df创建后将code列设置为Categorical类型的有序变量,如下所示.

I have the Pandas (version 0.15.2) dataframe below. I want to make the code column an ordered variable of type Categorical after the df creation as below.

import pandas as pd
df = pd.DataFrame({'id' : range(1,9),
                    'code' : ['one', 'one', 'two', 'three',
                                'two', 'three', 'one', 'two'],
                    'amount' : np.random.randn(8)},  columns= ['id','code','amount'])

df.code = df.code.astype('category')
>> 0      one
>> 1      one
>> 2      two
>> 3    three
>> 4      two
>> 5    three
>> 6      one
>> 7      two
>> Name: code, dtype: category
>> Categories (3, object): [one < three < two]

所以这有效,但仅部分有效.我不能强加命令.以下所有功能均在文档网页中进行了演示,为我抛出语法错误:

So this works, but only partially. I cannot impose the order. All functionality below, which are demonstrated on the documentation webpage, throw syntax errors for me:

df.code = df.code.astype('category', categories=['one','two','three'], ordered=True)
>> error: astype() got an unexpected keyword argument 'categories'

甚至:

df.code.ordered
>> error: 'Series' object has no attribute 'ordered'
df.code.categories
>> error: 'Series' object has no attribute 'categories'

1).这很烦人.我什至无法获得我的Categorical变量的类别(级别).我是在做错什么,还是Web文档过时/不一致?

1) This is annoying. I cannot even get the categories (levels) of my Categorical variable. Am I doing something wrong or is the web documentation out of date/ inconsistent?

2)另外,您是否知道类型Categorical是否具有距离概念,即熊猫是否知道根据上述顺序,onetwo更接近two c7>?我打算将此用于相似度计算.

2) Also, do you know whether the type Categorical has a distance notion, i.e. does Pandas know that based on the ordering above, one is closer to two than three? I plan to use this for (dis)similarity calculation.

推荐答案

这是一个简短的示例,其中包含有序的分类变量,并且(对我而言)使用rank()(作为一种距离度量)的结果令人惊讶:

Here's a short example with an ordered categorical variable and (to me) a surprising result from using rank() (as a sort of distance measure):

df = pd.DataFrame({ 'code':['one','two','three','one'], 'num':[1,2,3,1] }) 
df.code = df.code.astype('category', categories=['one','two','three'], ordered=True)

    code  num
0    one    1
1    two    2
2  three    3
3    one    1

df.sort('code')

    code  num
0    one    1
3    one    1
1    two    2
2  three    3

所以sort()可以按指定的顺序按预期工作.但是rank()并没有我想像的那样,它按字典顺序进行排序,并忽略了分类变量的顺序.

So sort() works as expected, in the order specified. But rank() doesn't do what I would have guessed, it ranks lexicographically and ignores the ordering of the categorical variable.

 df.sort('code').rank()

   code  num
0   1.5  1.5
3   1.5  1.5
1   4.0  3.0
2   3.0  4.0

所有这些也许是一个更长的问题:也许您只想要一个整数类型?我的意思是,您可以在此进行后排序,从而构成某种距离函数,但最终要比使用标准int或float进行更多的工作(如果您查看rank()的方式,可能会出现问题处理有序的类别.

All of which is perhaps a longer way of asking: Maybe you just want an integer type? I mean, you could make up some kind of distance function here post-sorting, but ultimately that's going to be a lot more work than what you could do with a standard int or float (and possibly problematic if you look at how rank() handles an ordered categorical.

添加添加内容:以上部分内容可能不适用于熊猫15.2,但我相信您仍然可以这样做来指定顺序:

edit to add: Part of the above may not work for pandas 15.2 but I believe you can still do this to specify order:

df['code'].cat.categories = ['one','two','three']

默认情况下,在15.2中会发生什么(据我了解),默认情况下,ordered将为True(但在版本16.0中为False),但顺序将是字典式的,而不是构造函数中指定的顺序.我不确定,但是正在16.0中工作,因此您只需要观察版本的行为即可.请记住,分类还是相当新的东西.

What will happen in 15.2 by default (as I understand it) is that ordered will be True by default (but False in version 16.0), but order will be lexicographical rather than as specified in the constructor. I'm not sure though, and am working in 16.0 so you'll have to just observe how your version behaves. Remember that Categorical is still fairly new...

这篇关于Pandas分类数据类型的行为不符合预期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆