编码 pandas 中的字符串功能 [英] Encoding string features in pandas
问题描述
我有如下数据框
train_df
'type', 'manufacturer', 'year', 'num_doors'
sedan, bmw, 2012, 4
couple, audi, 2014, 2
and so on
和test_df
的格式相似
所有功能都是分类功能(一些字符串,一些int),我想将它们编码为分类变量.
and test_df
in similar format
All the features are categorical features (some string, some int) and I want to encode them as categorical variables.
在pandas/sklearn中处理这些分类变量的好方法是什么 另外,一旦将转换应用于火车df.我也要按照这些编码对test_df进行编码吗?
Whats a good way to handle these categorical variables in pandas/sklearn Also, once the transformation is applied on train df.. I want to encode the test_df also as per these encodings?
推荐答案
在读取数据时,将dtype
指定为category
,以使每一个单列本质上是分类的.
When reading your data, specify dtype
to be category
to make every single column categorical in nature.
df = pd.read_csv('file.csv', dtype='category')
df
type manufacturer year num_doors
0 sedan bmw 2012 4
1 couple audi 2014 2
df.dtypes
type category
manufacturer category
year category
num_doors category
dtype: object
如果您只想转换特定的列子集,则可以这样做-
If you want to convert only a specific subset of columns, something like this would do -
f = dict.fromkeys(['type', 'manufacturer', ...], 'categorical')
将f
传递给dtype
.
df = pd.read_csv('file.csv', dtype=f)
这篇关于编码 pandas 中的字符串功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!