将字符串数组(类别)从pandas数据帧转换为int数组 [英] Convert array of string (category) to array of int from a pandas dataframe

查看:121
本文介绍了将字符串数组(类别)从pandas数据帧转换为int数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试做与上一个非常相似的操作问题,但出现错误. 我有一个包含特征,标签的pandas数据框,我需要做一些转换才能将特征和label变量发送到机器学习对象中:

I am trying to do something very similar to that previous question but I get an error. I have a pandas dataframe containing features,label I need to do some convertion to send the features and the label variable into a machine learning object:

import pandas
import milk
from scikits.statsmodels.tools import categorical

那我有:

trainedData=bigdata[bigdata['meta']<15]
untrained=bigdata[bigdata['meta']>=15]
#print trainedData
#extract two columns from trainedData
#convert to numpy array
features=trainedData.ix[:,['ratio','area']].as_matrix(['ratio','area'])
un_features=untrained.ix[:,['ratio','area']].as_matrix(['ratio','area'])
print 'features'
print features[:5]
##label is a string:single, touching,nuclei,dust
print 'labels'

labels=trainedData.ix[:,['type']].as_matrix(['type'])
print labels[:5]
#convert single to 0, touching to 1, nuclei to 2, dusts to 3
#
tmp=categorical(labels,drop=True)
targets=categorical(labels,drop=True).argmax(1)
print targets

输出控制台首先产生:

features
[[ 0.38846334  0.97681855]
[ 3.8318634   0.5724734 ]
[ 0.67710876  1.01816444]
[ 1.12024943  0.91508699]
[ 7.51749674  1.00156707]]
labels
[[single]
[touching]
[single]
[single]
[nuclei]]

我遇到了以下错误:

Traceback (most recent call last):
File "/home/claire/Applications/ProjetPython/projet particule et objet/karyotyper/DAPI-Trainer02-MILK.py", line 83, in <module>
tmp=categorical(labels,drop=True)
File "/usr/local/lib/python2.6/dist-packages/scikits.statsmodels-0.3.0rc1-py2.6.egg/scikits/statsmodels/tools/tools.py", line 206, in categorical
tmp_dummy = (tmp_arr[:,None]==data).astype(float)
AttributeError: 'bool' object has no attribute 'astype'

是否可以将数据框中的类别变量类型"转换为int类型? 'type'可以取值'single','touching','nuclei','dusts',我需要使用int值进行转换,例如0、1、2、3.

Is it possible to convert the category variable 'type' within the dataframe into int type ? 'type' can take the values 'single', 'touching','nuclei','dusts' and I need to convert with int values such 0, 1, 2, 3.

推荐答案

如果您有字符串或其他对象的向量,并希望为其提供分类标签,则可以使用Factor类(在名称空间):

If you have a vector of strings or other objects and you want to give it categorical labels, you can use the Factor class (available in the pandas namespace):

In [1]: s = Series(['single', 'touching', 'nuclei', 'dusts', 'touching', 'single', 'nuclei'])

In [2]: s
Out[2]: 
0    single
1    touching
2    nuclei
3    dusts
4    touching
5    single
6    nuclei
Name: None, Length: 7

In [4]: Factor(s)
Out[4]: 
Factor:
array([single, touching, nuclei, dusts, touching, single, nuclei], dtype=object)
Levels (4): [dusts nuclei single touching]

该因子具有属性labelslevels:

In [7]: f = Factor(s)

In [8]: f.labels
Out[8]: array([2, 3, 1, 0, 3, 2, 1], dtype=int32)

In [9]: f.levels
Out[9]: Index([dusts, nuclei, single, touching], dtype=object)

这是用于一维矢量的,因此不确定是否可以立即将其应用于您的问题,但请看一下.

This is intended for 1D vectors so not sure if it can be instantly applied to your problem, but have a look.

顺便说一句,我建议您在statsmodels和/或scikit-learn邮件列表上询问这些问题,因为我们大多数人都不是SO用户.

BTW I recommend that you ask these questions on the statsmodels and / or scikit-learn mailing list since most of us are not frequent SO users.

这篇关于将字符串数组(类别)从pandas数据帧转换为int数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆