为 Scikit-Learn 向量化 Pandas 数据框 [英] Vectorizing a Pandas dataframe for Scikit-Learn
问题描述
假设我在 Pandas 中有一个如下所示的数据框:
>my_dataframe列 1 列 2阿福乙吧某事阿福一间酒吧富
其中行代表实例,列代表输入特征(不显示目标标签,但这将用于分类任务),即我试图从 my_dataframe中构建 X代码>.
如何使用例如有效地矢量化它DictVectorizer
?
我是否需要先将 DataFrame 中的每个条目都转换为字典?(这是在上面链接中的示例中完成的方式).有没有更有效的方法来做到这一点?
首先,我不明白你的样本数组中哪些是特征,哪些是观察.
其次,DictVectorizer
不保存数据,仅用于转换实用程序和元数据存储.转换后,它存储特征名称和映射.它返回一个 numpy 数组,用于进一步计算.Numpy 数组(特征矩阵)大小等于 特征计数
x 观察次数
,其值等于观察的特征值.因此,如果您知道自己的观察结果和特征,则可以按照您喜欢的任何其他方式创建此数组.
如果您希望 sklearn 为您做这件事,您不必手动重建 dict,因为它可以通过将 to_dict
应用于转置数据帧来完成:
<小时>
自 scikit-learn 0.13.0(2014 年 1 月 3 日)以来,to_dict()
方法可用,因此现在您可以简单地使用此方法而无需额外操作:
Say I have a dataframe in Pandas like the following:
> my_dataframe
col1 col2
A foo
B bar
C something
A foo
A bar
B foo
where rows represent instances, and columns input features (not showing the target label, but this would be for a classification task), i.e. I trying to build X out of my_dataframe
.
How can I vectorize this efficiently using e.g. DictVectorizer
?
Do I need to convert each and every entry in my DataFrame to a dictionary first? (that's the way it is done in the example in the link above). Is there a more efficient way to do this?
First, I don't get where in your sample array are features, and where observations.
Second, DictVectorizer
holds no data, and is only about transformation utility and metadata storage. After transformation it stores features names and mapping. It returns a numpy array, used for further computations. Numpy array (features matrix) size equals to features count
x number of observations
, with values equal to feature value for an observation. So if you know your observations and features, you can create this array any other way you like.
In case you expect sklearn do it for you, you don't have to reconstruct dict manually, as it can be done with to_dict
applied to transposed dataframe:
>>> df
col1 col2
0 A foo
1 B bar
2 C foo
3 A bar
4 A foo
5 B bar
>>> df.T.to_dict().values()
[{'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}, {'col2': 'foo', 'col1': 'C'}, {'col2': 'bar', 'col1': 'A'}, {'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}]
Since scikit-learn 0.13.0 (Jan 3, 2014) there is a new parameter 'records'
for the to_dict()
method available, so now you can simple use this method without additional manipulations:
>>> df = pandas.DataFrame({'col1': ['A', 'B', 'C', 'A', 'A', 'B'], 'col2': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar']})
>>> df
col1 col2
0 A foo
1 B bar
2 C foo
3 A bar
4 A foo
5 B bar
>>> df.to_dict('records')
[{'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}, {'col2': 'foo', 'col1': 'C'}, {'col2': 'bar', 'col1': 'A'}, {'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}]
这篇关于为 Scikit-Learn 向量化 Pandas 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!