将字符串数组(类别)从pandas数据帧转换为int数组 [英] Convert array of string (category) to array of int from a pandas dataframe
问题描述
我正在尝试做与上一个非常相似的操作问题,但出现错误. 我有一个包含特征,标签的pandas数据框,我需要做一些转换才能将特征和label变量发送到机器学习对象中:
I am trying to do something very similar to that previous question but I get an error. I have a pandas dataframe containing features,label I need to do some convertion to send the features and the label variable into a machine learning object:
import pandas
import milk
from scikits.statsmodels.tools import categorical
那我有:
trainedData=bigdata[bigdata['meta']<15]
untrained=bigdata[bigdata['meta']>=15]
#print trainedData
#extract two columns from trainedData
#convert to numpy array
features=trainedData.ix[:,['ratio','area']].as_matrix(['ratio','area'])
un_features=untrained.ix[:,['ratio','area']].as_matrix(['ratio','area'])
print 'features'
print features[:5]
##label is a string:single, touching,nuclei,dust
print 'labels'
labels=trainedData.ix[:,['type']].as_matrix(['type'])
print labels[:5]
#convert single to 0, touching to 1, nuclei to 2, dusts to 3
#
tmp=categorical(labels,drop=True)
targets=categorical(labels,drop=True).argmax(1)
print targets
输出控制台首先产生:
features
[[ 0.38846334 0.97681855]
[ 3.8318634 0.5724734 ]
[ 0.67710876 1.01816444]
[ 1.12024943 0.91508699]
[ 7.51749674 1.00156707]]
labels
[[single]
[touching]
[single]
[single]
[nuclei]]
我遇到了以下错误:
Traceback (most recent call last):
File "/home/claire/Applications/ProjetPython/projet particule et objet/karyotyper/DAPI-Trainer02-MILK.py", line 83, in <module>
tmp=categorical(labels,drop=True)
File "/usr/local/lib/python2.6/dist-packages/scikits.statsmodels-0.3.0rc1-py2.6.egg/scikits/statsmodels/tools/tools.py", line 206, in categorical
tmp_dummy = (tmp_arr[:,None]==data).astype(float)
AttributeError: 'bool' object has no attribute 'astype'
是否可以将数据框中的类别变量类型"转换为int类型? 'type'可以取值'single','touching','nuclei','dusts',我需要使用int值进行转换,例如0、1、2、3.
Is it possible to convert the category variable 'type' within the dataframe into int type ? 'type' can take the values 'single', 'touching','nuclei','dusts' and I need to convert with int values such 0, 1, 2, 3.
推荐答案
如果您有字符串或其他对象的向量,并希望为其提供分类标签,则可以使用Factor
类(在
If you have a vector of strings or other objects and you want to give it categorical labels, you can use the Factor
class (available in the pandas
namespace):
In [1]: s = Series(['single', 'touching', 'nuclei', 'dusts', 'touching', 'single', 'nuclei'])
In [2]: s
Out[2]:
0 single
1 touching
2 nuclei
3 dusts
4 touching
5 single
6 nuclei
Name: None, Length: 7
In [4]: Factor(s)
Out[4]:
Factor:
array([single, touching, nuclei, dusts, touching, single, nuclei], dtype=object)
Levels (4): [dusts nuclei single touching]
该因子具有属性labels
和levels
:
In [7]: f = Factor(s)
In [8]: f.labels
Out[8]: array([2, 3, 1, 0, 3, 2, 1], dtype=int32)
In [9]: f.levels
Out[9]: Index([dusts, nuclei, single, touching], dtype=object)
这是用于一维矢量的,因此不确定是否可以立即将其应用于您的问题,但请看一下.
This is intended for 1D vectors so not sure if it can be instantly applied to your problem, but have a look.
顺便说一句,我建议您在statsmodels和/或scikit-learn邮件列表上询问这些问题,因为我们大多数人都不是SO用户.
BTW I recommend that you ask these questions on the statsmodels and / or scikit-learn mailing list since most of us are not frequent SO users.
这篇关于将字符串数组(类别)从pandas数据帧转换为int数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!