如何从字符串列生成分类的Pandas DataFrame列? [英] How to generate pandas DataFrame column of Categorical from string column?
问题描述
我可以将pandas字符串列转换为Categorical,但是当我尝试将其作为新的DataFrame列插入时,似乎可以转换回str系列:
I can convert a pandas string column to Categorical, but when I try to insert it as a new DataFrame column it seems to get converted right back to Series of str:
train['LocationNFactor'] = pd.Categorical.from_array(train['LocationNormalized'])
>>> type(pd.Categorical.from_array(train['LocationNormalized']))
<class 'pandas.core.categorical.Categorical'>
# however it got converted back to...
>>> type(train['LocationNFactor'][2])
<type 'str'>
>>> train['LocationNFactor'][2]
'Hampshire'
猜测是因为Categorical不会映射到任何numpy dtype.那么我是否必须将其转换为某些int类型,从而丢失因子标签的级别关联? 存储级别标签关联并保留转换回来的能力的最优雅的解决方法是什么? (只需存储为这样的字典即可,并在需要时进行手动转换?) 我认为类别仍然不是DataFrame的一流数据类型,与R不同.
Guessing this is because Categorical doesn't map to any numpy dtype; so do I have to convert it to some int type, and thus lose the factor labels<->levels association? What's the most elegant workaround to store the levels<->labels association and retain the ability to convert back? (just store as a dict like here, and manually convert when needed?) I think Categorical is still not a first-class datatype for DataFrame, unlike R.
(使用pandas 0.10.1,numpy 1.6.2,python 2.7.3-一切的最新macports版本).
(Using pandas 0.10.1, numpy 1.6.2, python 2.7.3 - the latest macports versions of everything).
推荐答案
我发现的0.15以下熊猫的唯一解决方法如下:
The only workaround for pandas pre-0.15 I found is as follows:
- 必须将列转换为分类器,但numpy会立即将级别强制转换为int,从而丢失因子信息
- 因此将因子存储在数据框外部的全局变量中
.
train_LocationNFactor = pd.Categorical.from_array(train['LocationNormalized']) # default order: alphabetical
train['LocationNFactor'] = train_LocationNFactor.labels # insert in dataframe
[更新:熊猫 0.15+添加了对分类的体面支持 ]
这篇关于如何从字符串列生成分类的Pandas DataFrame列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!