LabelPropagation-如何避免被零除? [英] LabelPropagation - How to avoid division by zero?
问题描述
在使用 LabelPropagation 时,我经常遇到这个警告(恕我直言,应该是一个错误,因为它完全使传播失败了):
/usr/local/lib/python3.5/dist-packages/sklearn/semi_supervised/label_propagation.py:279:RuntimeWarning:true_divide中遇到无效值self.label_distributions_/=规范化器
因此,在尝试了RBF内核后,我发现参数 gamma
具有影响力.
问题来自这些行:
如果self._variant =='传播':归一化= np.sum(self.label_distributions_,axis = 1)[:, np.newaxis]self.label_distributions_/=规范化器
我不知道label_distributions_如何全为零,尤其是当其定义为:
self.label_distributions_ = safe_sparse_dot(graph_matrix,self.label_distributions_)
Gamma对graph_matrix有影响(因为graph_matrix是_build_graph()调用内核函数的结果).好的.但是还是.出了什么问题
旧帖子(编辑前)
我提醒您如何计算传播的图形权重:W = exp(-gamma * D),D是数据集所有点之间的成对距离矩阵.
问题是: np.exp(x)
如果x非常小,则返回0.0 .
假设我们有两个点 i
和 j
,这样 dist(i,j)= 10
.
>>>np.exp(np.asarray(-10 * 40,dtype = float))#gamma = 40 =>好的1.9151695967140057e-174>>>np.exp(np.asarray(-10 * 120,dtype = float))#gamma = 120 =>不行0.0
实际上,我不是手动设置伽玛,而是使用解决方案
基本上,您正在执行 softmax
函数,对吧?
防止 softmax
溢出/溢出的一般方法是(来自此处:
def rbf_kernel_safe(X,Y = None,gamma = None):X,Y = sklearn.metrics.pairwise.check_pairwise_arrays(X,Y)如果gamma为None:伽玛= 1.0/X.shape [1]K = sklearn.metrics.pairwise.euclidean_distances(X,Y,squared = True)K * =-伽玛K-= K.max()np.exp(K,K)#就地对K求幂返回K
然后在传播中使用它
LabelPropagation(内核= rbf_kernel_safe,tol = 0.01,gamma = 20).fit(X,Y)
不幸的是,我只有 v0.18
,它不接受 LabelPropagation
的用户定义的内核函数,因此我无法对其进行测试.
检查源代码为什么有这么大的 gamma
值使我想知道您是否正在使用 gamma = D.min()/3
,这是不正确的.定义为 sigma = D.min()/3
,而 w
中的 sigma
定义为
w = exp(-d ** 2/sigma ** 2)#公式(1)
这将使正确的 gamma
值 1/sigma ** 2
或 9/D.min()** 2
>
When using LabelPropagation, I often run into this warning (imho it should be an error because it completely fails the propagation):
/usr/local/lib/python3.5/dist-packages/sklearn/semi_supervised/label_propagation.py:279: RuntimeWarning: invalid value encountered in true_divide self.label_distributions_ /= normalizer
So after few tries with the RBF kernel, I discovered the paramater gamma
has an influence.
EDIT:
The problem comes from these lines:
if self._variant == 'propagation':
normalizer = np.sum(
self.label_distributions_, axis=1)[:, np.newaxis]
self.label_distributions_ /= normalizer
I don't get how label_distributions_ can be all zeros, especially when its definition is:
self.label_distributions_ = safe_sparse_dot(
graph_matrix, self.label_distributions_)
Gamma has an influence on the graph_matrix (because graph_matrix is the result of _build_graph() that call the kernel function). OK. But still. Something's wrong
OLD POST (before edit)
I remind you how graph weights are computed for the propagation: W = exp(-gamma * D), D the pairwise distance matrix between all points of the dataset.
The problem is: np.exp(x)
returns 0.0 if x very small.
Let's imagine we have two points i
and j
such that dist(i, j) = 10
.
>>> np.exp(np.asarray(-10*40, dtype=float)) # gamma = 40 => OKAY
1.9151695967140057e-174
>>> np.exp(np.asarray(-10*120, dtype=float)) # gamma = 120 => NOT OKAY
0.0
In practice, I'm not setting gamma manually but I'm using the method described in this paper (section 2.4).
So, how would one avoid this division by zero to get a proper propagation ?
The only way I can think of is to normalize the dataset in every dimension, but we lose some geometric/topologic property of the dataset (a 2x10 rectangle becoming a 1x1 square for example)
Reproductible example:
In this example, it's worst: even with gamma = 20 it fails.
In [11]: from sklearn.semi_supervised.label_propagation import LabelPropagation
In [12]: import numpy as np
In [13]: X = np.array([[0, 0], [0, 10]])
In [14]: Y = [0, -1]
In [15]: LabelPropagation(kernel='rbf', tol=0.01, gamma=20).fit(X, Y)
/usr/local/lib/python3.5/dist-packages/sklearn/semi_supervised/label_propagation.py:279: RuntimeWarning: invalid value encountered in true_divide
self.label_distributions_ /= normalizer
/usr/local/lib/python3.5/dist-packages/sklearn/semi_supervised/label_propagation.py:290: ConvergenceWarning: max_iter=1000 was reached without convergence.
category=ConvergenceWarning
Out[15]:
LabelPropagation(alpha=None, gamma=20, kernel='rbf', max_iter=1000, n_jobs=1,
n_neighbors=7, tol=0.01)
In [16]: LabelPropagation(kernel='rbf', tol=0.01, gamma=2).fit(X, Y)
Out[16]:
LabelPropagation(alpha=None, gamma=2, kernel='rbf', max_iter=1000, n_jobs=1,
n_neighbors=7, tol=0.01)
In [17]:
Basically you're doing a softmax
function, right?
The general way to prevent softmax
from over/underflowing is (from here)
# Instead of this . . .
def softmax(x, axis = 0):
return np.exp(x) / np.sum(np.exp(x), axis = axis, keepdims = True)
# Do this
def softmax(x, axis = 0):
e_x = np.exp(x - np.max(x, axis = axis, keepdims = True))
return e_x / e_x.sum(axis, keepdims = True)
This bounds e_x
between 0 and 1, and assures one value of e_x
will always be 1
(namely the element np.argmax(x)
). This prevents overflow and underflow (when np.exp(x.max())
is either bigger or smaller than float64
can handle).
In this case, as you can't change the algorithm, I would take the input D
and make D_ = D - D.min()
as this should be numerically equivalent to the above, as W.max()
should be -gamma * D.min()
(as you're just flipping the sign). The do your algorithm with regards to D_
EDIT:
As recommended by @PaulBrodersen below, you can build a "safe" rbf kernel based on the sklearn
implementation here:
def rbf_kernel_safe(X, Y=None, gamma=None):
X, Y = sklearn.metrics.pairwise.check_pairwise_arrays(X, Y)
if gamma is None:
gamma = 1.0 / X.shape[1]
K = sklearn.metrics.pairwise.euclidean_distances(X, Y, squared=True)
K *= -gamma
K -= K.max()
np.exp(K, K) # exponentiate K in-place
return K
And then use it in your propagation
LabelPropagation(kernel = rbf_kernel_safe, tol = 0.01, gamma = 20).fit(X, Y)
Unfortunately I only have v0.18
, which doesn't accept user-defined kernel functions for LabelPropagation
, so I can't test it.
EDIT2:
Checking your source for why you have such large gamma
values makes me wonder if you are using gamma = D.min()/3
, which would be incorrect. The definition is sigma = D.min()/3
while the definition of sigma
in w
is
w = exp(-d**2/sigma**2) # Equation (1)
which would make the correct gamma
value 1/sigma**2
or 9/D.min()**2
这篇关于LabelPropagation-如何避免被零除?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!