如何将分类数据转换为数值数据? [英] How to convert categorical data to numerical data?
本文介绍了如何将分类数据转换为数值数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有 feature => city
这是分类数据,即字符串,但不是使用 replace()
进行硬编码,有什么聪明的方法吗?
train['city'].unique()输出:['city_149'、'city_83'、'city_16'、'city_64'、'city_100'、'city_21'、'city_114'、'city_103'、'city_97'、'city_160'、'city_65'、'city_90'、'city_75'、'city_136'、'city_159'、'city_67'、'city_28'、'city_10'、'city_73'、'city_76'、'city_104'、'city_27'、'city_30'、'city_61'、'city_99'、'city_41'、'city_142'、'city_9'、'city_116'、'city_128'、'city_74'、'city_69'、'city_1'、'city_176'、'city_40'、'city_123'、'city_152'、'city_165'、'city_89'、'city_36'、......]
我在尝试什么:
train.replace(['city_149', 'city_83', 'city_16', 'city_64', 'city_100', 'city_21','city_114'、'city_103'、'city_97'、'city_160'、'city_65'、'city_90'、'city_75'、'city_136'、'city_159'、'city_67'、'city_28'、'city_10'、'city_73'、'city_76'、'city_104'、'city_27'、'city_30'、'city_61'、'city_99'、'city_41'、'city_142'、'city_9'、'city_116'、'city_128'、'city_74'、'city_69'、'city_1'、'city_176'、'city_40'、'city_123', 'city_152', 'city_165', 'city_89', 'city_36', ....], [1,2,3,4,5,6,7,8,9....],就地=真)
有没有更好的方法将数据转换为数字?因为唯一值的数量是123
.所以我需要对 1,2,3,4,...123 中的数字进行硬编码来转换它.建议一些更好的方法将其转换为数值.
解决方案
尝试 pd.factorize()
:
train['city'] = pd.factorize(train.city)[0]
train['city'] = train['city'].astype('category').cat.codes
例如:
<预><代码>>>>火车城市0 city_1511 city_1492 city_1513 city_1494 city_1495 city_1496 city_1517 city_1518 city_1509 city_151分解
:
train['city'] = pd.factorize(train.city)[0]>>>火车城市0 01 12 03 14 15 16 07 08 29 0
或 astype('category')
:
train['city'] = train['city'].astype('category').cat.codes>>>火车城市0 21 02 23 04 05 06 27 28 19 2
I have feature => city
which is categorical data i.e string but instead of hardcoding using replace()
is there any smart approach ?
train['city'].unique()
Output: ['city_149', 'city_83', 'city_16', 'city_64', 'city_100', 'city_21',
'city_114', 'city_103', 'city_97', 'city_160', 'city_65',
'city_90', 'city_75', 'city_136', 'city_159', 'city_67', 'city_28',
'city_10', 'city_73', 'city_76', 'city_104', 'city_27', 'city_30',
'city_61', 'city_99', 'city_41', 'city_142', 'city_9', 'city_116',
'city_128', 'city_74', 'city_69', 'city_1', 'city_176', 'city_40',
'city_123', 'city_152', 'city_165', 'city_89', 'city_36', .......]
What I was trying :
train.replace(['city_149', 'city_83', 'city_16', 'city_64', 'city_100', 'city_21',
'city_114', 'city_103', 'city_97', 'city_160', 'city_65',
'city_90', 'city_75', 'city_136', 'city_159', 'city_67', 'city_28',
'city_10', 'city_73', 'city_76', 'city_104', 'city_27', 'city_30',
'city_61', 'city_99', 'city_41', 'city_142', 'city_9', 'city_116',
'city_128', 'city_74', 'city_69', 'city_1', 'city_176', 'city_40',
'city_123', 'city_152', 'city_165', 'city_89', 'city_36', .......], [1,2,3,4,5,6,7,8,9....], inplace=True)
Is there any better way to convert the data into numerical ? Because the number of unique values are 123
.
So I need to hard code numbers from 1,2,3,4,...123 to convert it. Suggest some better way to convert it into numerical value.
解决方案
Try pd.factorize()
:
train['city'] = pd.factorize(train.city)[0]
train['city'] = train['city'].astype('category').cat.codes
For example:
>>> train
city
0 city_151
1 city_149
2 city_151
3 city_149
4 city_149
5 city_149
6 city_151
7 city_151
8 city_150
9 city_151
factorize
:
train['city'] = pd.factorize(train.city)[0]
>>> train
city
0 0
1 1
2 0
3 1
4 1
5 1
6 0
7 0
8 2
9 0
Or astype('category')
:
train['city'] = train['city'].astype('category').cat.codes
>>> train
city
0 2
1 0
2 2
3 0
4 0
5 0
6 2
7 2
8 1
9 2
这篇关于如何将分类数据转换为数值数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文