如何在 pandas 中将变量指定为序数/分类? [英] How to specify a variable in pandas as ordinal/categorical?

查看:99
本文介绍了如何在 pandas 中将变量指定为序数/分类?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用scikit-learn在数据集上运行一些机器学习算法.我的数据集具有一些类似于类别的功能.就像一个功能是A一样,它的值1,2,3指定事物的质量. 1:Upper, 2: Second, 3: Third class.因此,这是一个序数变量.

I am trying to run some Machine learning algo on a dataset using scikit-learn. My dataset has some features which are like categories. Like one feature is A, which has values 1,2,3 specifying the quality of something. 1:Upper, 2: Second, 3: Third class. So it's an ordinal variable.

类似地,我将变量City重新编码,将三个值('London', Zurich', 'New York'转换为1,2,3,但对这些值没有特定的偏好.所以现在这是一个名义上的分类变量.

Similarly I re-coded a variable City, having three values ('London', Zurich', 'New York' into 1,2,3 but with no specific preference for the values. So now this is a nominal categorical variable.

我如何指定算法将大熊猫中的这些归类为分类和有序等?与R中一样,分类变量由factor(a)指定,因此不被视为连续值.在pandas/python中有类似的东西吗?

How do I specify the algorithm to consider these as categorical and ordinal etc. in pandas?. Like in R, a categorical variable is specified by factor(a) and hence is not considered a continuous value. Is there anything like that in pandas/python?

推荐答案

...几年后(而且,因为我认为不仅要对这个问题进行充分的解释,而且还需要在以后提醒自己)

... years later (and because I think a good explanation of these issues is required not only for this question but to help remind myself in the future)

通常,人们会将类别变量转换为伪变量(或其他方法),因为是名义上的,例如他们对a > b > c的了解没有.在OP的原始问题中,这将在诸如伦敦,苏黎世,纽约等城市进行.

In general, one would translate categorical variables into dummy variables (or a host of other methodologies), because they were nominal, e.g. they had no sense of a > b > c . In OPs original question, this would only be performed on the Cities, like London, Zurich, New York.

对于这种类型的问题,pandas使用 pandas.get_dummies 提供了迄今为止最简单的转换.所以:

For this type of issue, pandas provides -- by far -- the easiest transformation using pandas.get_dummies. So:

# create a sample of OPs unique values
series = pandas.Series(
           numpy.random.randint(low=0, high=3, size=100))
mapper = {0: 'New York', 1: 'London', 2: 'Zurich'}
nomvar = series.replace(mapper)

# now let's use pandas.get_dummies
print(
    pandas.get_dummies(series.replace(mpr))

Out[57]:
    London  New York  Zurich
0        0         0       1
1        0         1       0
2        0         1       0
3        1         0       0

分类变量的序数编码

但是,对于序数变量,用户必须谨慎使用 pandas.factorize .原因是工程师想要保留映射中的关系,以使a > b > c.

因此,如果要在large > medium > small处使用一组类别变量并将其保存,则需要确保pandas.factorize保留该关系.

So if I want to take a set of categorical variables where large > medium > small, and preserve that, I need to make sure that pandas.factorize preserves that relationship.

# leveraging the variables already created above
mapper = {0: 'small', 1: 'medium', 2: 'large'}
ordvar = series.replace(mapper)

print(pandas.factorize(ordvar))

Out[58]:
(array([0, 1, 1, 2, 1,...  0, 0]),
Index(['large', 'small', 'medium'], dtype='object'))

实际上,使用pandas.factorize丢失了为了保持序数概念而需要保留的关系.在这样的实例中,我使用自己的映射来确保保留序数属性.

In fact, the relationship that needs to be preserved in order to maintain the concept of ordinal has been lost using pandas.factorize. In an instance like this, I use my own mappings to ensure that the ordinal attributes are preserved.

preserved_mapper = {'large':2 , 'medium': 1, 'small': 0}
ordvar.replace(preserved_mapper)
print(ordvar.replace(preserved_mapper))

Out[78]:
0     2
1     0
...
99    2
dtype: int64

实际上,通过创建自己的dict来映射值是一种方法,不仅可以保留所需的序数关系,还可以用作保持预测算法的内容和映射有条理",从而不仅可以您在此过程中没有丢失任何序数信息,但是还存储了每个变量的每个映射是什么的记录.

In fact, by creating your own dict to map the values is a way to not only preserve your desired ordinal relationship but also can be used as "keeping the contents and mappings of your prediction algorithm organized" ensuring that not only have you not lost any ordinal information in the process, but also have stored records of what each mapping for each variable is.

最后,OP谈到将信息传递到scikit-lean分类器中,这意味着需要int.对于这种情况,请确保您知道 astype(int)误区 此处(如果您有任何<数据中的c21> s.

Lastly, the OP spoke about passing the information into scikit-lean classifiers, which means that ints are required. For that case, make sure you're aware of the astype(int) gotcha that is detailed here if you have any NaNs in your data.

这篇关于如何在 pandas 中将变量指定为序数/分类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆