如何将分类数据转换为数值数据? [英] How to convert categorical data to numerical data?

查看:180
本文介绍了如何将分类数据转换为数值数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 feature => city 这是分类数据,即字符串,但不是使用 replace() 进行硬编码,有什么聪明的方法吗?

train['city'].unique()输出:['city_149'、'city_83'、'city_16'、'city_64'、'city_100'、'city_21'、'city_114'、'city_103'、'city_97'、'city_160'、'city_65'、'city_90'、'city_75'、'city_136'、'city_159'、'city_67'、'city_28'、'city_10'、'city_73'、'city_76'、'city_104'、'city_27'、'city_30'、'city_61'、'city_99'、'city_41'、'city_142'、'city_9'、'city_116'、'city_128'、'city_74'、'city_69'、'city_1'、'city_176'、'city_40'、'city_123'、'city_152'、'city_165'、'city_89'、'city_36'、......]

我在尝试什么:

train.replace(['city_149', 'city_83', 'city_16', 'city_64', 'city_100', 'city_21','city_114'、'city_103'、'city_97'、'city_160'、'city_65'、'city_90'、'city_75'、'city_136'、'city_159'、'city_67'、'city_28'、'city_10'、'city_73'、'city_76'、'city_104'、'city_27'、'city_30'、'city_61'、'city_99'、'city_41'、'city_142'、'city_9'、'city_116'、'city_128'、'city_74'、'city_69'、'city_1'、'city_176'、'city_40'、'city_123', 'city_152', 'city_165', 'city_89', 'city_36', ....], [1,2,3,4,5,6,7,8,9....],就地=真)

有没有更好的方法将数据转换为数字?因为唯一值的数量是123.所以我需要对 1,2,3,4,...123 中的数字进行硬编码来转换它.建议一些更好的方法将其转换为数值.

解决方案

尝试 pd.factorize():

train['city'] = pd.factorize(train.city)[0]

categorical dtypes:

train['city'] = train['city'].astype('category').cat.codes

例如:

<预><代码>>>>火车城市0 city_1511 city_1492 city_1513 city_1494 city_1495 city_1496 city_1517 city_1518 city_1509 city_151

分解:

train['city'] = pd.factorize(train.city)[0]>>>火车城市0 01 12 03 14 15 16 07 08 29 0

astype('category'):

train['city'] = train['city'].astype('category').cat.codes>>>火车城市0 21 02 23 04 05 06 27 28 19 2

I have feature => city which is categorical data i.e string but instead of hardcoding using replace() is there any smart approach ?

train['city'].unique()
Output: ['city_149', 'city_83', 'city_16', 'city_64', 'city_100', 'city_21',
       'city_114', 'city_103', 'city_97', 'city_160', 'city_65',
       'city_90', 'city_75', 'city_136', 'city_159', 'city_67', 'city_28',
       'city_10', 'city_73', 'city_76', 'city_104', 'city_27', 'city_30',
       'city_61', 'city_99', 'city_41', 'city_142', 'city_9', 'city_116',
       'city_128', 'city_74', 'city_69', 'city_1', 'city_176', 'city_40',
       'city_123', 'city_152', 'city_165', 'city_89', 'city_36', .......]

What I was trying :

train.replace(['city_149', 'city_83', 'city_16', 'city_64', 'city_100', 'city_21',
           'city_114', 'city_103', 'city_97', 'city_160', 'city_65',
           'city_90', 'city_75', 'city_136', 'city_159', 'city_67', 'city_28',
           'city_10', 'city_73', 'city_76', 'city_104', 'city_27', 'city_30',
           'city_61', 'city_99', 'city_41', 'city_142', 'city_9', 'city_116',
           'city_128', 'city_74', 'city_69', 'city_1', 'city_176', 'city_40',
           'city_123', 'city_152', 'city_165', 'city_89', 'city_36', .......], [1,2,3,4,5,6,7,8,9....], inplace=True)

Is there any better way to convert the data into numerical ? Because the number of unique values are 123. So I need to hard code numbers from 1,2,3,4,...123 to convert it. Suggest some better way to convert it into numerical value.

解决方案

Try pd.factorize():

train['city'] = pd.factorize(train.city)[0]

Or categorical dtypes:

train['city'] = train['city'].astype('category').cat.codes

For example:

>>> train
       city
0  city_151
1  city_149
2  city_151
3  city_149
4  city_149
5  city_149
6  city_151
7  city_151
8  city_150
9  city_151

factorize:

train['city'] = pd.factorize(train.city)[0]

>>> train
   city
0     0
1     1
2     0
3     1
4     1
5     1
6     0
7     0
8     2
9     0

Or astype('category'):

train['city'] = train['city'].astype('category').cat.codes

>>> train
   city
0     2
1     0
2     2
3     0
4     0
5     0
6     2
7     2
8     1
9     2

这篇关于如何将分类数据转换为数值数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆