稀疏观测矩阵上的分层聚类 [英] Hierarchical clustering on sparse observation matrix
问题描述
我正在尝试对大型稀疏观察矩阵执行层次聚类.该矩阵表示多个用户的电影评分.我的目标是根据他们的电影偏好对相似的用户进行聚类.但是,我需要一个树状图,而不是单一的部门.为了做到这一点,我尝试使用 SciPy:
I'm trying to perform hierarchical clustering on large sparse observation matrix. The matrix represents movie ratings for a number of users. My goal is to cluster similar users based on their movie preferences. However, I need a dendrogram, rather than single division. In order to do this, I tried to use SciPy:
R = dok_matrix((nrows, ncols), dtype=np.float32)
for user in ratings:
for item in ratings[user]:
R[item, user] = ratings[user][item]
Z = hierarchy.linkage(R.transpose().toarray(), method='ward')
这适用于小数据集:
但是,我(显然)在扩展时遇到了内存问题.如果有什么办法可以将稀疏矩阵提供给算法?
However, I (obviously) get memory problems when scaling up. If there any way I can feed sparse matrix to the algorithm?
推荐答案
从 scipy/cluster/hierarchy.py
linkage
处理 y
参数为:
From scipy/cluster/hierarchy.py
linkage
processes the y
argument as:
y = _convert_to_double(np.asarray(y, order='c'))
if y.ndim == 1:
distance.is_valid_y(y, throw=True, name='y')
[y] = _copy_arrays_if_base_present([y])
elif y.ndim == 2:
if method in _EUCLIDEAN_METHODS and metric != 'euclidean':
raise ValueError("Method '{0}' requires the distance metric "
"to be Euclidean".format(method))
y = distance.pdist(y, metric)
else:
raise ValueError("`y` must be 1 or 2 dimensional.")
当我将 asarray
应用于 dok
时,我得到一个 0d 对象数组.它只是将字典包装在一个数组中.
When I apply asarray
to a dok
I get a 0d object array. It just wraps the dictionary in an array.
In [905]: M=sparse.dok_matrix([[1,0,0,2,3],[0,0,0,0,1]])
In [906]: M
Out[906]:
<2x5 sparse matrix of type '<class 'numpy.int32'>'
with 4 stored elements in Dictionary Of Keys format>
In [908]: m = np.asarray(M)
In [909]: m
Out[909]:
array(<2x5 sparse matrix of type '<class 'numpy.int32'>'
with 4 stored elements in Dictionary Of Keys format>, dtype=object)
In [910]: m.shape
Out[910]: ()
linkage
接受一维压缩样式距离矩阵,或等效的二维矩阵.
linkage
accepts a 1d compressed style distance matrix, or the equivalent 2d one.
进一步查看 linkage
我推断 ward
使用 nn_chain
,它在编译的 scipy/cluster/_hierarchy.cpython 中-35m-i386-linux-gnu.so
文件.这使得该方法的工作部分甚至超出了临时 Python 程序员的能力范围.
Looking further in linkage
I deduce that ward
uses nn_chain
, which is in the compiled scipy/cluster/_hierarchy.cpython-35m-i386-linux-gnu.so
file. That puts the working part of the method even further out of reach of the casual Python programmer.
这篇关于稀疏观测矩阵上的分层聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!