马氏距离的多元离群值去除 [英] Multivariate Outlier Removal With Mahalanobis Distance
问题描述
我有此异常值的数据.我如何找到马哈拉诺比斯人的分歧并使用它来消除异常值.
首先让我提出一些一般准则:
- 实际上,如果您具有很多功能并且样本较少,那么Mahalanobis算法往往会产生误导性的结果(您可以自己尝试),因此,您拥有的功能越多,应提供的样本就越多.
- 协方差矩阵必须为
解决方案是
其中:
-
x
是观测,用于查找其距离; -
m
是观测值的平均值; -
S
是首先,我们需要创建每个样本的 功能 的协方差矩阵,这就是为什么将参数
rowvar
设置为<covariance_matrix = np.cov(数据,rowvar = False)#此处的数据类似于上表# 在这张图中
接下来,我们找到协方差矩阵的 逆 :
inv_covariance_matrix = np.linalg.inv(covariance_matrix)
但是,在继续操作之前,如上所述,我们应该检查矩阵及其逆数是否为 对称 和 正定.我们为此
请注意,我重复每一行只是为了利用矩阵减法,如下所示.
接下来,我们找到
x-m
(即微分),但是由于我们已经有了矢量化的vars_mean
,所以我们要做的就是:diff =数据-vars_mean#这里我们减去特征的均值每个示例的每个功能中的#
最后,应用如下公式:
md = []对于范围内的 i(len(diff)):md.append(np.sqrt(diff [i] .dot(inv_covariance_matrix).dot(diff [i]))))
请注意以下几点:
- 协方差矩阵的逆维数为:
number_of_features x number_of_features
-
diff
矩阵的维类似于原始数据矩阵:number_of_examples x number_of_features
- 因此,每个
diff [i]
(即行)为1 x number_of_features
. - 因此,根据矩阵乘法规则,从
diff [i] .dot(inv_covariance_matrix)
生成的矩阵将为1 x number_of_features
;当我们再次乘以diff [i]
时;numpy
自动将后者视为列矩阵(即number_of_features x 1
);所以最终结果将成为一个单一的值(即不需要转置).
为了检测异常值,我们应该指定一个
阈值
;我们通过将马氏距离结果的均值乘以极端度k
来做到这一点;其中k = 2.0 * std
表示极值,而3.0 * std
表示极值;而这是根据全部放在一起
将numpy导入为npdef create_data(示例数= 50,要素数5,上限边界数10,outliers_fraction = 0.1,extremes = False):'''这种测试方法(即生成2D数据数组)'''数据= []数量级= 4,如果极端则为3对于 i 在范围内(示例):if (examples - i) <= round((float(examples) * outliers_fraction)):data.append(np.random.poisson(上限**幅度,特征).tolist())别的:data.append(np.random.poisson(upper_bound,features).tolist())返回np.array(data)def MahalanobisDist(数据,详细= False):covariance_matrix = np.cov(数据,rowvar = False)如果is_pos_def(covariance_matrix):inv_covariance_matrix = np.linalg.inv(covariance_matrix)如果is_pos_def(inv_covariance_matrix):vars_mean = []对于范围内的i(data.shape [0]):vars_mean.append(list(data.mean(axis = 0)))差异=数据-vars_meanmd = []对于我在范围内(len(diff)):md.append(np.sqrt(diff [i] .dot(inv_covariance_matrix).dot(diff [i]))))如果冗长:print(协方差矩阵:\ n {} \ n" .format(covariance_matrix))print(协方差矩阵的逆:\ n {} \ n" .format(inv_covariance_matrix))print(变量均值向量:\ n {} \ n" .format(vars_mean))print(变量-变量均值向量:\ n {} \ n" .format(diff))print("Mahalanobis距离:\ n {} \ n" .format(md))返回md别的:print(错误:协方差矩阵的逆不是正定的!")别的:print(错误:协方差矩阵不是正定的!")def MD_detectOutliers(数据,极端=假,详细=假):MD = MahalanobisDist(数据,详细)#一种指定阈值的流行方法#m = np.mean(MD)#t =3.* m,如果极端则为2.* m#outliers = []#for我在范围内(len(MD)):#如果MD [i]>t:#outliers.append(i)#离群值的索引#return np.array(离群值)#或根据68–95–99.7规则标准= np.std(MD)k = 3. * std,如果极端,则为2.* stdm = np.mean(MD)up_t = m + klow_t = m-k离群值= []对于范围内的我(len(MD)):如果(MD [i]> = up_t)或(MD [i]< = low_t):outliers.append(i)#离群值的索引返回np.array(异常值)def is_pos_def(A):如果np.allclose(A,A.T):尝试:np.linalg.cholsky(A)返回True除了np.linalg.LinAlgError:返回False别的:返回False数据= create_data(15,3,10,0.1)打印(数据:\n {}\n".格式(数据))outliers_indices = MD_detectOutliers(data,verbose = True)print(离群值索引:{} \ n" .format(outliers_indices))打印(离群值:")对于ii中的outliers_indices:打印(数据[ii])
结果
数据:[[12 7 9][9 16 7][14 11 10][14 5 5][12 8 7][8 8 10][9 14 8][12 12 10][18 10 6][6 12 11][4 12 15][5 13 10][8 9 8][106 116 97][90 116 114]]协方差矩阵:[[980.17142857 1143.62857143 1035.6][1143.62857143 1385.11428571 1263.12857143][1035.6 1263.12857143 1170.74285714]]协方差矩阵的逆:[[0.03021777 -0.03563241 0.0117146][-0.03563241 0.08684092 -0.06217448][0.0117146 -0.06217448 0.05757261]]变量均值向量:[[21.8,24.6,21.8],[21.8,24.6,21.8],[21.8,24.6,21.8],[21.8,24.6,21.8],[21.8,24.6,21.8],[21.8,24.6,21.8],[21.8、24.6、21.8],[21.8、24.6、21.8],[21.8、24.6、21.8],[21.8、24.6、21.8],[21.8、24.6、21.8],[21.8、24.6、21.8],[21.8,24.6、21.8],[21.8、24.6、21.8],[21.8、24.6、21.8]]变量-变量均值向量:[[-9.8 -17.6 -12.8][-12.8 -8.6 -14.8][-7.8 -13.6 -11.8][-7.8 -19.6 -16.8][-9.8 -16.6 -14.8][-13.8 -16.6 -11.8][-12.8 -10.6 -13.8][-9.8 -12.6 -11.8][-3.8 -14.6 -15.8][-15.8 -12.6 -10.8][-17.8 -12.6 -6.8][-16.8 -11.6 -11.8][-13.8 -15.6 -13.8][84.2 91.4 75.2][68.2 91.4 92.2]]马氏距离:[1.3669401667524865、2.1796331318432967、0.7470525416547134、1.6364973119931507、0.8351423113609481、0.9128858131134882、1.397144258271586、0.35603382066414996、1.4449501739129382、0.9668775289588046、1.490503433100514、1.4021488309805878、0.4500345257064412异常值指数:[13 14]离群值:[106 116 97][90 116 114]
I have this data which have outlier . How can i find Mahalanobis disantance and use it to remove outlier.
解决方案Let me first put some general guidelines:
- Practically speaking, if you have a lot of features and lesser samples, Mahalanobis algorithm tends to give misleading results (you can try it yourself), so the more features you have, the more samples you should provide.
- The covariance matrix must be Symmetric and Positive Definite to make the algorithm works, so you should check before proceeding.
As it's already mentioned, Euclidean Metric fails to find the correct distance because it tries to get ordinary straight-line distance. Thus, if we have multi-dimensional space of variables, two points may look to have the same distance from the Mean, yet one of them is far away from the data cloud (i.e. it's an outlier).
The solution is Mahalanobis Distance which makes something similar to the feature scaling via taking the Eigenvectors of the variables instead of the original axis.
It applies the following formula:
where:
x
is the observation to find its distance;m
is the mean of the observations;S
is the Covariance Matrix.
Refresher:
The Covariance represents the direction of the relationship between two variables (i.e. positive, negative or zero), so it shows the strength of how one variable is related to the changes of the others.
Implementation
Consider this 6x3 dataset, in which each row represents a sample, and each column represents a feature of the given sample:
First, we need to create a Covariance Matrix of the features of each sample, and that's why we set the parameter
rowvar
toFalse
in the numpy.cov function, so each column now represents a variable:covariance_matrix = np.cov(data, rowvar=False) # data here looks similar to the above table # in the picture
Next, we find the Inverse of the Covariance Matrix:
inv_covariance_matrix = np.linalg.inv(covariance_matrix)
But before proceeding, we should check, as mentioned above, if the matrix and its inverse are Symmetric and Positive Definite. We use for this Cholesky Decomposition Algorithm, which, fortunately, is already implemented in numpy.linalg.cholesky:
def is_pos_def(A): if np.allclose(A, A.T): try: np.linalg.cholesky(A) return True except np.linalg.LinAlgError: return False else: return False
Then, we find the mean
m
of the variables on each feature (shall I say dimension) and save them in an array like this:vars_mean = [] for i in range(data.shape[0]): vars_mean.append(list(data.mean(axis=0))) # axis=0 means each column in the 2D array
Note that I repeated each row just to avail of matrix subtraction as will be shown next.
Next, we find
x - m
(i.e. the differential), but since we already have the vectorizedvars_mean
, all we need to do is:diff = data - vars_mean # here we subtract the mean of feature # from each feature of each example
Finally, apply the formula like this:
md = [] for i in range(len(diff)): md.append(np.sqrt(diff[i].dot(inv_covariance_matrix).dot(diff[i])))
Note the followings:
- The dimension of the inverse of the covariance matrix is:
number_of_features x number_of_features
- The dimension of the
diff
matrix is similar to the original data matrix:number_of_examples x number_of_features
- Thus, each
diff[i]
(i.e. row) is1 x number_of_features
. - So according to the Matrix Multiplication rule, the resulted matrix from
diff[i].dot(inv_covariance_matrix)
will be1 x number_of_features
; and when we multiply again bydiff[i]
;numpy
automatically considers the latter as a column matrix (i.e.number_of_features x 1
); so the final result will become a single value (i.e. no need for transpose).
In order to detect outliers, we should specify a
threshold
; we do so by multiplying the Mean of the Mahalanobis Distance Results by the Extremeness Degreek
; wherek = 2.0 * std
for extreme values, and3.0 * std
for the very extreme values; and that's according to the 68–95–99.7 rule (image for illustration from the same link):
Putting All Together
import numpy as np def create_data(examples=50, features=5, upper_bound=10, outliers_fraction=0.1, extreme=False): ''' This method for testing (i.e. to generate a 2D array of data) ''' data = [] magnitude = 4 if extreme else 3 for i in range(examples): if (examples - i) <= round((float(examples) * outliers_fraction)): data.append(np.random.poisson(upper_bound ** magnitude, features).tolist()) else: data.append(np.random.poisson(upper_bound, features).tolist()) return np.array(data) def MahalanobisDist(data, verbose=False): covariance_matrix = np.cov(data, rowvar=False) if is_pos_def(covariance_matrix): inv_covariance_matrix = np.linalg.inv(covariance_matrix) if is_pos_def(inv_covariance_matrix): vars_mean = [] for i in range(data.shape[0]): vars_mean.append(list(data.mean(axis=0))) diff = data - vars_mean md = [] for i in range(len(diff)): md.append(np.sqrt(diff[i].dot(inv_covariance_matrix).dot(diff[i]))) if verbose: print("Covariance Matrix:\n {}\n".format(covariance_matrix)) print("Inverse of Covariance Matrix:\n {}\n".format(inv_covariance_matrix)) print("Variables Mean Vector:\n {}\n".format(vars_mean)) print("Variables - Variables Mean Vector:\n {}\n".format(diff)) print("Mahalanobis Distance:\n {}\n".format(md)) return md else: print("Error: Inverse of Covariance Matrix is not positive definite!") else: print("Error: Covariance Matrix is not positive definite!") def MD_detectOutliers(data, extreme=False, verbose=False): MD = MahalanobisDist(data, verbose) # one popular way to specify the threshold #m = np.mean(MD) #t = 3. * m if extreme else 2. * m #outliers = [] #for i in range(len(MD)): # if MD[i] > t: # outliers.append(i) # index of the outlier #return np.array(outliers) # or according to the 68–95–99.7 rule std = np.std(MD) k = 3. * std if extreme else 2. * std m = np.mean(MD) up_t = m + k low_t = m - k outliers = [] for i in range(len(MD)): if (MD[i] >= up_t) or (MD[i] <= low_t): outliers.append(i) # index of the outlier return np.array(outliers) def is_pos_def(A): if np.allclose(A, A.T): try: np.linalg.cholesky(A) return True except np.linalg.LinAlgError: return False else: return False data = create_data(15, 3, 10, 0.1) print("data:\n {}\n".format(data)) outliers_indices = MD_detectOutliers(data, verbose=True) print("Outliers Indices: {}\n".format(outliers_indices)) print("Outliers:") for ii in outliers_indices: print(data[ii])
Result
data: [[ 12 7 9] [ 9 16 7] [ 14 11 10] [ 14 5 5] [ 12 8 7] [ 8 8 10] [ 9 14 8] [ 12 12 10] [ 18 10 6] [ 6 12 11] [ 4 12 15] [ 5 13 10] [ 8 9 8] [106 116 97] [ 90 116 114]] Covariance Matrix: [[ 980.17142857 1143.62857143 1035.6 ] [1143.62857143 1385.11428571 1263.12857143] [1035.6 1263.12857143 1170.74285714]] Inverse of Covariance Matrix: [[ 0.03021777 -0.03563241 0.0117146 ] [-0.03563241 0.08684092 -0.06217448] [ 0.0117146 -0.06217448 0.05757261]] Variables Mean Vector: [[21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8]] Variables - Variables Mean Vector: [[ -9.8 -17.6 -12.8] [-12.8 -8.6 -14.8] [ -7.8 -13.6 -11.8] [ -7.8 -19.6 -16.8] [ -9.8 -16.6 -14.8] [-13.8 -16.6 -11.8] [-12.8 -10.6 -13.8] [ -9.8 -12.6 -11.8] [ -3.8 -14.6 -15.8] [-15.8 -12.6 -10.8] [-17.8 -12.6 -6.8] [-16.8 -11.6 -11.8] [-13.8 -15.6 -13.8] [ 84.2 91.4 75.2] [ 68.2 91.4 92.2]] Mahalanobis Distance: [1.3669401667524865, 2.1796331318432967, 0.7470525416547134, 1.6364973119931507, 0.8351423113609481, 0.9128858131134882, 1.397144258271586, 0.35603382066414996, 1.4449501739129382, 0.9668775289588046, 1.490503433100514, 1.4021488309805878, 0.4500345257064412, 3.239353067840299, 3.260149280200771] Outliers Indices: [13 14] Outliers: [106 116 97] [ 90 116 114]
这篇关于马氏距离的多元离群值去除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- 协方差矩阵的逆维数为:
-