如何为Sci-kit Learn重新格式化分类 pandas 变量 [英] How to reformat categorical Pandas variables for Sci-kit Learn

查看:112
本文介绍了如何为Sci-kit Learn重新格式化分类 pandas 变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一个如下所示的pandas dataFrame:

Given a pandas dataFrame that looks like this:

|       | c_0337 | c_0348 | c_0351 | c_0364 |
|-------|:------:|-------:|--------|--------|
| id    |        |        |        |        |
| 11193 |    a   |      f | o      | a      |
| 11382 |    a   |      k | s      | a      |
| 16531 |    b   |      p | f      | b      |
| 1896  |    a   |      f | o      | NaN    |

我正在尝试将分类变量转换为数值型(最好是二进制的true false列),我尝试使用scikit学习的OneHotEncoder 如下:

I am trying to convert the categorical variables to numeric (preferably binary true false columns) I tried using the OneHotEncoder from scikit learn as follows:

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit([c4k.ix[:,'c_0327':'c_0351'].values])  
OneHotEncoder(categorical_features='all',
   n_values='auto', sparse=True) 

这给了我:以10为底的long()无效字面量:'f'

That just gave me: invalid literal for long() with base 10: 'f'

我需要将数据放入Scikit学习可接受的数组中,对于大多数条目(例如非常稀疏)创建的列是否为false,对于包含相应字母的已创建列,则为true?

I need to get the data into an array acceptable to Scikit learn, with columns being created with false for most entries (eg very sparse) true for the created column that contains the corresponding letter?

NaN为0 = false

with NaN being 0=false

我怀疑我要离开这里吗?喜欢甚至不使用正确的预处理器吗?

I suspect I'm way off here? Like not even using the right preprocessor?

这是全新的,因此任何指针都赞赏实际数据集具有1000多个此类列... 因此,我尝试如下使用DictVectorizer:

Brand new at this so any pointers appreciated the actual dataset has over 1000 such columns...... So then I tried using DictVectorizer as follows:

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer() 
#fill df with zeros Since we don't want NaN
c4kNZ=c4k.ix[:,'c_0327':'c_0351'].fillna(0) 
#Make the dataFrame a Dict 
c4kb=c4kNZ.to_dict() 
sdata = vec.fit_transform(c4kb) 

它给我的float()参数必须是字符串或数字–我重新检查了字典,对我来说似乎还可以,但我想我没有正确设置其格式?

It gives me float() argument must be a string or a number – I rechecked the dict and it looks ok to me but I guess I have not gotten it formatted correctly?

推荐答案

这是您要寻找的吗?
它使用get_dummies将分类列转换为指示值存在的稀疏伪列:

Is this what you are looking for?
It is using get_dummies to convert categorical columns into sparse dummy columns indicating the presence of a value:

In [12]: df = pd.DataFrame({'c_0337':list('aaba'), 'c_0348':list('fkpf')})

In [13]: df
Out[13]:
  c_0337 c_0348
0      a      f
1      a      k
2      b      p
3      a      f

In [14]: pd.get_dummies(df)
Out[14]:
   c_0337_a  c_0337_b  c_0348_f  c_0348_k  c_0348_p
0         1         0         1         0         0
1         1         0         0         1         0
2         0         1         0         0         1
3         1         0         1         0         0

这篇关于如何为Sci-kit Learn重新格式化分类 pandas 变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆