编码 pandas 中的字符串功能 [英] Encoding string features in pandas

查看:78
本文介绍了编码 pandas 中的字符串功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有如下数据框

train_df
'type', 'manufacturer', 'year', 'num_doors'
sedan, bmw, 2012, 4
couple, audi, 2014, 2
and so on

test_df的格式相似 所有功能都是分类功能(一些字符串,一些int),我想将它们编码为分类变量.

and test_df in similar format All the features are categorical features (some string, some int) and I want to encode them as categorical variables.

在pandas/sklearn中处理这些分类变量的好方法是什么 另外,一旦将转换应用于火车df.我也要按照这些编码对test_df进行编码吗?

Whats a good way to handle these categorical variables in pandas/sklearn Also, once the transformation is applied on train df.. I want to encode the test_df also as per these encodings?

推荐答案

在读取数据时,将dtype指定为category,以使每一个单列本质上是分类的.

When reading your data, specify dtype to be category to make every single column categorical in nature.

df = pd.read_csv('file.csv', dtype='category')
df

     type manufacturer  year num_doors
0   sedan          bmw  2012         4
1  couple         audi  2014         2

df.dtypes

type            category
manufacturer    category
year            category
num_doors       category
dtype: object

如果您只想转换特定的列子集,则可以这样做-

If you want to convert only a specific subset of columns, something like this would do -

f = dict.fromkeys(['type', 'manufacturer', ...], 'categorical')

f传递给dtype.

df = pd.read_csv('file.csv', dtype=f)

这篇关于编码 pandas 中的字符串功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆