R的群集程序包中的daisy()的Python等效项 [英] Python equivalent of daisy() in the cluster package of R

查看:140
本文介绍了R的群集程序包中的daisy()的Python等效项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含分类(标称和序数)和数字属性的数据集.我想使用这些混合属性来计算我的观察结果中的(不相似度)矩阵.使用 daisy()函数在R中的群集程序包中,我可以轻松获得如下所示的相异矩阵:

I have a dataset that contains both categorical (nominal and ordinal) and numerical attributes. I want to calculate the (dis)similarity matrix across my observations using these mixed attributes. Using the daisy() function of the cluster package in R, I can easily get a dissimilarity matrix as follows:

if(!require("cluster")) { install.packages("cluster");  require("cluster") }
data(flower)
as.matrix(daisy(flower, metric = "gower"))

这使用gower度量标准来处理名义变量. R中是否有与daisy()函数等效的Python?

This uses the gower metric to deal with the nominal variables. Is there a Python equivalent of the daisy() function in R?

或者其他允许使用Gower度量标准或类似方法为具有混合(标称,数字)属性的数据集计算(不相似度)矩阵的模块功能?

Or maybe any other module function that allows using the Gower metric or something similar to calculate the (dis)similarity matrix for a dataset with mixed (nominal, numeric) attributes?

推荐答案

我相信您正在寻找

I believe you are looking for scipy.spatial.distance.pdist.

如果实现一个函数来计算一对观测值的高尔距离,则可以将该函数传递给pdist,它将成对应用,并返回成对距离的矩阵.高尔距离似乎不是内置选项之一.

If you implement a function that computes the Gower distance on a single pair of observations, you can pass that function to pdist and it will apply it pairwise and return the resulting matrix of pairwise distances. It does not appear that the Gower distance is one of the built-in options.

同样,如果单个观测值具有混合属性,则可以定义自己的函数,例如,对数字属性的子集使用欧几里得距离,对分类属性的子集使用高尔距离,然后将其相加-或任何其他实现方式,对于您的应用程序来说,计算两个孤立的观测值之间的距离.

Likewise, if a single observation has mixed attributes, you can just define your own function which, say, uses something like the Euclidean distance on the subset of numerical attributes, a Gower distance on the subset of categorical attributes, and adds them -- or any other implementation of what it means to you, for your application, to compute the distance between two isolated observations.

对于使用Python进行群集化,通常您希望使用 scikits.learn scikits.learn 和此问题和答案页面确切地讨论了使用带有距离的自定义距离测量(在您的情况下为Gower)的问题-这似乎是不可能的.

For clustering in Python, usually you want to work with scikits.learn and this question and answer page discusses exactly this problem of using a custom distance measure (in your case Gower) with scikits -- which does not appear possible.

您可以使用pdist提供的选项之一以及该链接的答案页面上的实现-或可以实现Gower相似性函数并使用该函数.但是,如果您想从scikits中获得现成的群集工具,则似乎不可能直接实现.

You could use one of the choices provided by pdist along with the implementation at that linked answer page -- or you could implement a function for the Gower similarity and use that. But if you want the out-of-the-box clustering tools from scikits, it does not appear to be directly possible.

这篇关于R的群集程序包中的daisy()的Python等效项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆