scikit-learn 流形学习函数中的 NaN/inf 值 [英] NaN/inf values in scikit-learn manifold learning functions

查看:40
本文介绍了scikit-learn 流形学习函数中的 NaN/inf 值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个流形学习/非线性降维问题,我知道对象之间的距离达到某个阈值,除此之外我只知道距离是远".此外,在某些情况下,某些距离可能会丢失.我正在尝试使用 sklearn.manifold 来执行查找一维表示的任务.一种自然的表示是将远"距离表示为 inf,将缺失的距离表示为 nan.

I have a manifold learning / non-linear dimensionality reduction problem where I know distances between objects up to some threshold, and beyond that I just know that the distance is "far". Also, in some cases some of the distances might be missing. I am trying to use sklearn.manifold in order to perform the task of finding a 1d representation. A natural representation would be to represent "far" distances an inf and missing distances as nan.

然而,目前 scikit-learn 似乎不支持 naninf 距离矩阵中的值给 中的流形学习函数code>sklearn.manifold,因为我得到 ValueError: Array contains NaN or infinity.

However, it seems that currently scikit-learn does not support nan and inf values in distance matrices given to manifold learning functions in sklearn.manifold, since I get ValueError: Array contains NaN or infinity.

这有什么概念上的原因吗?有些方法似乎特别适合 inf,例如非公制 MDS.我也知道这些方法在其他语言中的一些实现能够处理缺失/inf 值.

Is there a conceptual reason for this? Some methods seem to be especially suitable for inf, e.g. non-metric MDS. Also I know that some implementations of these methods in other languages are able to handle missing/inf values.

我考虑将far"值设置为一个非常大的数字,而不是使用 inf,但我不确定这将如何影响结果.

Instead of using inf I have considered setting "far" values to a very large number, but I am not sure how this will affect the results.

更新:

我挖了sklearn.manifold.MDS._smacof_single()的代码,发现一段代码和一条注释说与0的相似性被认为是缺失值".这是指定缺失值的无证方法吗?这是否适用于所有流形函数?

I dug in the code of sklearn.manifold.MDS._smacof_single() and found a piece of code and a comment saying that "similarities with 0 are considered as missing values". Is this an undocumented way to specify missing-values? Does this work with all manifold functions?

推荐答案

简短回答:正如您提到的,非度量 MDS 能够处理不完整的相异矩阵.您是对的:在使用 MDS(metric=False) 时,将值设置为零允许将被解释为缺失值.它不适用于不基于非度量 MDS 的其他流形学习程序,但可能有类似(未记录)的方法可用.

Short answer: As you mentioned the non-metric MDS is capable of working with incomplete dissimilarity matrices. You are right: Setting values to zero allows will be interpreted as missing values when using MDS(metric=False). It won't work for other manifold learning procedures that are not based on non-metric MDS, but there might be similar (non-documented) approaches available.

关于你的问题用高值替换 inf 肯定会塑造你的低维表示.这是否有效是一个概念性问题,只有知道 inf 值的来源才能回答这个问题.inf-entries 的意思是这些数据彼此相距甚远",用高值替换是否有意义(就像你的情况一样).如果它缺少关于不同之处的知识,我不建议用 inf 替换.如果没有其他解决方案(例如非度量 MDS 或矩阵完成),那么我宁愿建议在这种情况下用可测量距离的中位数替换(checkout Imputation).

On your question concerning Replacing inf by high values will shape your low dimensional representation for sure. Whether this is valid rather is a conceptual question that one can only answer knowing the origin of the inf values. Is the inf-entries mean something like "these data are reeaaaalllyyyy distant from each other" replacement by high values can make sense (like in your case). If it is rather missing knowledge about the dissimilarity I would not recommend to replace by inf. If there is no other solution (like non-metric MDS or matrix completion) then I would rather recommend to replace by the median of the measurable distances in such cases (checkout Imputation).

查看我的答案2017 年的类似问题.

Checkout my answer to a similar question from 2017.

这篇关于scikit-learn 流形学习函数中的 NaN/inf 值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆