LabelEncoder适用于Pandas df的顺序 [英] LabelEncoder order of fit for a Pandas df

查看:46
本文介绍了LabelEncoder适用于Pandas df的顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在安装scikit-learn LabelEncoder 放在熊猫 df .

I am fitting a scikit-learn LabelEncoder on a column in a pandas df.

如何确定遇到的字符串映射到整数的顺序?是确定性的吗?

How is the order, in which the encountered strings are mapped to the integers, determined? Is it deterministic?

更重要的是,我可以指定此顺序吗?

More importantly, can I specify this order?

import pandas as pd
from sklearn import preprocessing

df = pd.DataFrame(data=["first", "second", "third", "fourth"], columns=['x'])
le = preprocessing.LabelEncoder()
le.fit(df['x'])
print list(le.classes_)
### this prints ['first', 'fourth', 'second', 'third']
encoded = le.transform(["first", "second", "third", "fourth"]) 
print encoded
### this prints [0 2 3 1]

我希望le.classes_["first", "second", "third", "fourth"],然后是encoded[0 1 2 3],因为这是字符串在列中出现的顺序.能做到吗?

I would expect le.classes_ to be ["first", "second", "third", "fourth"] and then encoded to be [0 1 2 3], since this is the order in which the strings appear in the column. Can this be done?

推荐答案

它是按排序顺序完成的.如果是字符串,则以字母顺序完成.没有文档,但是查看 LabelEncoder.transform ,我们可以看到该工作主要委托给了

It's done in sort order. In the case of strings, it is done in alphabetic order. There's no documentation for this, but looking at the source code for LabelEncoder.transform we can see the work is mostly delegated to the function numpy.setdiff1d, with the following documentation:

找到两个数组的设置差.

Find the set difference of two arrays.

返回ar1中不在中的 sorted 唯一值.

Return the sorted, unique values in ar1 that are not in ar2.

(强调我的).

请注意,由于未记录此文件,因此它可能是实现定义的,可以在版本之间进行更改.可能只是我看过的版本使用了排序顺序,而其他版本的scikit-learn可能会更改此行为(通过不使用numpy.setdiff1d).

Note that since this is not documented, it is probably implementation defined and can be changed between versions. It could be that just the version I looked use the sort order, and other versions of scikit-learn may change this behavior (by not using numpy.setdiff1d).

这篇关于LabelEncoder适用于Pandas df的顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆