想知道pd.factorize,pd.get_dummies,sklearn.preprocessing.LableEncoder和OneHotEncoder中的diff [英] Want to know the diff among pd.factorize, pd.get_dummies, sklearn.preprocessing.LableEncoder and OneHotEncoder
问题描述
现在我知道,我认为在内部,因子分解
和 LabelEncoder
的工作方式相同,在结果方面没有很大差异。我不知道他们是否会采用大量数据的类似时间。
get_dummies
和 OneHotEncoder
将产生相同的结果,但 OneHotEncoder
只能处理数字,但 get_dummies
将采取各种输入。 get_dummies
将为每个列输入自动生成新的列名,但 OneHotEncoder
不会(它会分配新的列名称1,2,3 ......)。所以 get_dummies
在所有方面都比较好。
如果我错了,请更正我!谢谢!
这四个编码器可分为两类:
将标签编码为分类变量:熊猫 / code>。结果将有一个维度。
OneHotEncoder
。结果将具有n维,一个由编码分类变量的不同值组成。大熊猫和scikit学习编码器之间的主要区别是使用scikit学习编码器在 scikit学习管道中使用 fit
和 transform
方法
将标签编入分类变量
熊猫 code>和scikit-learn
LabelEncoder
属于第一类。它们可用于创建分类变量,例如将字符转换为数字。
从sklearn导入预处理
#测试data
df = DataFrame(['A','B','B','C'],columns = ['Col'])
df ['Fact'] = pd.factorize df ['Col'])[0]
le = preprocessing.LabelEncoder()
df ['Lab'] = le.fit_transform(df ['Col'])
print(df)
#Col Fact Lab
#0 A 0 0
#1 B 1 1
#2 B 1 1
#3 C 2 2
将分类变量编码为虚拟/指标(二进制)变量
Pandas get_dummies
和scikit-learn OneHotEncoder
属于第二类。它们可用于创建二进制变量。 OneHotEncoder
只能与分类整数一起使用,而 get_dummies
可以与其他类型的变量一起使用。
df = DataFrame(['A','B','B','C'],columns = ['Col'])
df = pd.get_dummies(df)
print(df)
#Col_A Col_B Col_C
#0 1.0 0.0 0.0
#1 0.0 1.0 0.0
#2 0.0 1.0 0.0
#3 0.0 0.0 1.0
从sklearn.preprocessing import OneHotEncoder,LabelEncoder
df = DataFrame(['A','B' 'B','C'],columns = ['Col'])
#为了使用OneHotEncoder
le = preprocessing.LabelEncoder()$ b $,我们需要将第一个字符转换为整数b df ['Col'] = le.fit_transform(df ['Col'])
enc = OneHotEncoder()
df = DataFrame(enc.fit_transform(df).toarray())
print(df)
#0 1 2
#0 1.0 0.0 0.0
#1 0.0 1.0 0.0
#2 0.0 1.0 0.0
#3 0.0 0.0 1.0
All four functions seem really similar to me. In some situations some of them might give the same result, some not. Any help will be thankfully appreciated!
Now I know and I assume that internally, factorize
and LabelEncoder
work the same way and having no big differences in terms of results. I am not sure whether they will take up similar time with large magnitudes of data.
get_dummies
and OneHotEncoder
will yield the same result but OneHotEncoder
can only handle numbers but get_dummies
will take all kinds of input. get_dummies
will generate new column names automatically for each column input, but OneHotEncoder
will not (it rather will assign new column names 1,2,3....). So get_dummies
is better in all respectives.
Please correct me if I am wrong! Thank you!
These four encoders can be split in two categories:
- Encode labels into categorical variables: Pandas
factorize
and scikit-learnLabelEncoder
. The result will have 1 dimension. - Encode categorical variable into dummy/indicator (binary) variables: Pandas
get_dummies
and scikit-learnOneHotEncoder
. The result will have n dimensions, one by distinct value of the encoded categorical variable.
The main difference between pandas and scikit-learn encoders is that scikit-learn encoders are made to be used in scikit-learn pipelines with fit
and transform
methods.
Encode labels into categorical variables
Pandas factorize
and scikit-learn LabelEncoder
belong to the first category. They can be used to create categorical variables for example to transform characters into numbers.
from sklearn import preprocessing
# Test data
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
df['Fact'] = pd.factorize(df['Col'])[0]
le = preprocessing.LabelEncoder()
df['Lab'] = le.fit_transform(df['Col'])
print(df)
# Col Fact Lab
# 0 A 0 0
# 1 B 1 1
# 2 B 1 1
# 3 C 2 2
Encode categorical variable into dummy/indicator (binary) variables
Pandas get_dummies
and scikit-learn OneHotEncoder
belong to the second category. They can be used to create binary variables. OneHotEncoder
can only be used with categorical integers while get_dummies
can be used with other type of variables.
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
df = pd.get_dummies(df)
print(df)
# Col_A Col_B Col_C
# 0 1.0 0.0 0.0
# 1 0.0 1.0 0.0
# 2 0.0 1.0 0.0
# 3 0.0 0.0 1.0
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
# We need to transform first character into integer in order to use the OneHotEncoder
le = preprocessing.LabelEncoder()
df['Col'] = le.fit_transform(df['Col'])
enc = OneHotEncoder()
df = DataFrame(enc.fit_transform(df).toarray())
print(df)
# 0 1 2
# 0 1.0 0.0 0.0
# 1 0.0 1.0 0.0
# 2 0.0 1.0 0.0
# 3 0.0 0.0 1.0
这篇关于想知道pd.factorize,pd.get_dummies,sklearn.preprocessing.LableEncoder和OneHotEncoder中的diff的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!