Python中的矩阵完成 [英] Matrix completion in Python
问题描述
说我有一个矩阵:
> import numpy as nap
> a = np.random.random((5,5))
array([[ 0.28164485, 0.76200749, 0.59324211, 0.15201506, 0.74084168],
[ 0.83572213, 0.63735993, 0.28039542, 0.19191284, 0.48419414],
[ 0.99967476, 0.8029097 , 0.53140614, 0.24026153, 0.94805153],
[ 0.92478 , 0.43488547, 0.76320656, 0.39969956, 0.46490674],
[ 0.83315135, 0.94781119, 0.80455425, 0.46291229, 0.70498372]])
然后用np.NaN
在其中打一些孔,例如:
And that I punch some holes in it with np.NaN
, e.g.:
> a[(1,4,0,3),(2,4,2,0)] = np.NaN;
array([[ 0.80327707, 0.87722234, nan, 0.94463778, 0.78089194],
[ 0.90584284, 0.18348667, nan, 0.82401826, 0.42947815],
[ 0.05913957, 0.15512961, 0.08328608, 0.97636309, 0.84573433],
[ nan, 0.30120861, 0.46829231, 0.52358888, 0.89510461],
[ 0.19877877, 0.99423591, 0.17236892, 0.88059185, nan ]])
我想使用来自矩阵其余条目的信息来填充nan
条目.例如,使用出现nan
条目的列的平均值值.
I would like to fill-in the nan
entries using information from the rest of entries of the matrix. An example would be using the average value of the column where the nan
entries occur.
更一般而言,Python中是否有任何库可用于矩阵完成? (例如,类似于 Candes&ht的凸优化方法 ).
More generally, are there any libraries in Python for matrix completion ? (e.g. something along the lines of Candes & Recht's convex optimization method).
这个问题经常出现在机器学习中.例如,在分类/回归或 协作过滤中使用缺失功能时 (例如,在 Wikipedia 和此处)
This problem appears often in machine learning. For example when working with missing features in classification/regression or in collaborative filtering (e.g. see the Netflix Problem on Wikipedia and here)
推荐答案
如果安装最新的scikit-learn版本0.14a1,则可以使用其闪亮的新Imputer
类:
If you install the latest scikit-learn, version 0.14a1, you can use its shiny new Imputer
class:
>>> from sklearn.preprocessing import Imputer
>>> imp = Imputer(strategy="mean")
>>> a = np.random.random((5,5))
>>> a[(1,4,0,3),(2,4,2,0)] = np.nan
>>> a
array([[ 0.77473361, 0.62987193, nan, 0.11367791, 0.17633671],
[ 0.68555944, 0.54680378, nan, 0.64186838, 0.15563309],
[ 0.37784422, 0.59678177, 0.08103329, 0.60760487, 0.65288022],
[ nan, 0.54097945, 0.30680838, 0.82303869, 0.22784574],
[ 0.21223024, 0.06426663, 0.34254093, 0.22115931, nan]])
>>> a = imp.fit_transform(a)
>>> a
array([[ 0.77473361, 0.62987193, 0.24346087, 0.11367791, 0.17633671],
[ 0.68555944, 0.54680378, 0.24346087, 0.64186838, 0.15563309],
[ 0.37784422, 0.59678177, 0.08103329, 0.60760487, 0.65288022],
[ 0.51259188, 0.54097945, 0.30680838, 0.82303869, 0.22784574],
[ 0.21223024, 0.06426663, 0.34254093, 0.22115931, 0.30317394]])
此后,您可以使用imp.transform
对其他数据进行相同的转换,这是从a
那里学习到的imp
的意思. Imputer绑定到scikit-learn Pipeline
对象中,因此您可以在分类或回归管道中使用它们.
After this, you can use imp.transform
to do the same transformation to other data, using the mean that imp
learned from a
. Imputers tie into scikit-learn Pipeline
objects so you can use them in classification or regression pipelines.
如果您要等待稳定的发布,那么下周应该发布0.14.
If you want to wait for a stable release, then 0.14 should be out next week.
完全公开:我是scikit-learn核心开发人员
Full disclosure: I'm a scikit-learn core developer
这篇关于Python中的矩阵完成的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!