马氏距离的多元离群值去除 [英] Multivariate Outlier Removal With Mahalanobis Distance

查看:97
本文介绍了马氏距离的多元离群值去除的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有此异常值的数据.我如何找到马哈拉诺比斯人的分歧并使用它来消除异常值.

解决方案

首先让我提出一些一般准则:

  1. 实际上,如果您具有很多功能并且样本较少,那么Mahalanobis算法往往会产生误导性的结果(您可以自己尝试),因此,您拥有的功能越多,应提供的样本就越多.
  2. 协方差矩阵必须为


    解决方案是

    其中:

    • x 观测,用于查找其距离;
    • m 是观测值的平均值
    • S

      首先,我们需要创建每个样本的 功能 的协方差矩阵,这就是为什么将参数 rowvar 设置为<

        covariance_matrix = np.cov(数据,rowvar = False)#此处的数据类似于上表# 在这张图中 

      接下来,我们找到协方差矩阵的 :

        inv_covariance_matrix = np.linalg.inv(covariance_matrix) 

      但是,在继续操作之前,如上所述,我们应该检查矩阵及其逆数是否为 对称 正定.我们为此

      请注意,我重复每一行只是为了利用矩阵减法,如下所示.

      接下来,我们找到 x-m (即微分),但是由于我们已经有了矢量化的 vars_mean ,所以我们要做的就是:

        diff =数据-vars_mean#这里我们减去特征的均值每个示例的每个功能中的# 

      最后,应用如下公式:

        md = []对于范围内的 i(len(diff)):md.append(np.sqrt(diff [i] .dot(inv_covariance_matrix).dot(diff [i])))) 

      请注意以下几点:

      • 协方差矩阵的逆维数为: number_of_features x number_of_features
      • diff 矩阵的维类似于原始数据矩阵: number_of_examples x number_of_features
      • 因此,每个 diff [i] (即行)为 1 x number_of_features .
      • 因此,根据矩阵乘法规则,从 diff [i] .dot(inv_covariance_matrix)生成的矩阵将为 1 x number_of_features ;当我们再次乘以 diff [i] 时; numpy 自动将后者视为列矩阵(即 number_of_features x 1 );所以最终结果将成为一个单一的值(即不需要转置).

      为了检测异常值,我们应该指定一个阈值;我们通过将马氏距离结果的均值乘以极端度 k 来做到这一点;其中 k = 2.0 * std 表示极值,而 3.0 * std 表示极值;而这是根据


      全部放在一起

       将numpy导入为npdef create_data(示例数= 50,要素数5,上限边界数10,outliers_fraction = 0.1,extremes = False):'''这种测试方法(即生成2D数据数组)'''数据= []数量级= 4,如果极端则为3对于 i 在范围内(示例):if (examples - i) <= round((float(examples) * outliers_fraction)):data.append(np.random.poisson(上限**幅度,特征).tolist())别的:data.append(np.random.poisson(upper_bound,features).tolist())返回np.array(data)def MahalanobisDist(数据,详细= False):covariance_matrix = np.cov(数据,rowvar = False)如果is_pos_def(covariance_matrix):inv_covariance_matrix = np.linalg.inv(covariance_matrix)如果is_pos_def(inv_covariance_matrix):vars_mean = []对于范围内的i(data.shape [0]):vars_mean.append(list(data.mean(axis = 0)))差异=数据-vars_meanmd = []对于我在范围内(len(diff)):md.append(np.sqrt(diff [i] .dot(inv_covariance_matrix).dot(diff [i]))))如果冗长:print(协方差矩阵:\ n {} \ n" .format(covariance_matrix))print(协方差矩阵的逆:\ n {} \ n" .format(inv_covariance_matrix))print(变量均值向量:\ n {} \ n" .format(vars_mean))print(变量-变量均值向量:\ n {} \ n" .format(diff))print("Mahalanobis距离:\ n {} \ n" .format(md))返回md别的:print(错误:协方差矩阵的逆不是正定的!")别的:print(错误:协方差矩阵不是正定的!")def MD_detectOutliers(数据,极端=假,详细=假):MD = MahalanobisDist(数据,详细)#一种指定阈值的流行方法#m = np.mean(MD)#t =3.* m,如果极端则为2.* m#outliers = []#for我在范围内(len(MD)):#如果MD [i]>t:#outliers.append(i)#离群值的索引#return np.array(离群值)#或根据68–95–99.7规则标准= np.std(MD)k = 3. * std,如果极端,则为2.* stdm = np.mean(MD)up_t = m + klow_t = m-k离群值= []对于范围内的我(len(MD)):如果(MD [i]> = up_t)或(MD [i]< = low_t):outliers.append(i)#离群值的索引返回np.array(异常值)def is_pos_def(A):如果np.allclose(A,A.T):尝试:np.linalg.cholsky(A)返回True除了np.linalg.LinAlgError:返回False别的:返回False数据= create_data(15,3,10,0.1)打印(数据:\n {}\n".格式(数据))outliers_indices = MD_detectOutliers(data,verbose = True)print(离群值索引:{} \ n" .format(outliers_indices))打印(离群值:")对于ii中的outliers_indices:打印(数据[ii]) 


      结果

      数据:[[12 7 9][9 16 7][14 11 10][14 5 5][12 8 7][8 8 10][9 14 8][12 12 10][18 10 6][6 12 11][4 12 15][5 13 10][8 9 8][106 116 97][90 116 114]]协方差矩阵:[[980.17142857 1143.62857143 1035.6][1143.62857143 1385.11428571 1263.12857143][1035.6 1263.12857143 1170.74285714]]协方差矩阵的逆:[[0.03021777 -0.03563241 0.0117146][-0.03563241 0.08684092 -0.06217448][0.0117146 -0.06217448 0.05757261]]变量均值向量:[[21.8,24.6,21.8],[21.8,24.6,21.8],[21.8,24.6,21.8],[21.8,24.6,21.8],[21.8,24.6,21.8],[21.8,24.6,21.8],[21.8、24.6、21.8],[21.8、24.6、21.8],[21.8、24.6、21.8],[21.8、24.6、21.8],[21.8、24.6、21.8],[21.8、24.6、21.8],[21.8,24.6、21.8],[21.8、24.6、21.8],[21.8、24.6、21.8]]变量-变量均值向量:[[-9.8 -17.6 -12.8][-12.8 -8.6 -14.8][-7.8 -13.6 -11.8][-7.8 -19.6 -16.8][-9.8 -16.6 -14.8][-13.8 -16.6 -11.8][-12.8 -10.6 -13.8][-9.8 -12.6 -11.8][-3.8 -14.6 -15.8][-15.8 -12.6 -10.8][-17.8 -12.6 -6.8][-16.8 -11.6 -11.8][-13.8 -15.6 -13.8][84.2 91.4 75.2][68.2 91.4 92.2]]马氏距离:[1.3669401667524865、2.1796331318432967、0.7470525416547134、1.6364973119931507、0.8351423113609481、0.9128858131134882、1.397144258271586、0.35603382066414996、1.4449501739129382、0.9668775289588046、1.490503433100514、1.4021488309805878、0.4500345257064412异常值指数:[13 14]离群值:[106 116 97][90 116 114] 

      I have this data which have outlier . How can i find Mahalanobis disantance and use it to remove outlier.

      解决方案

      Let me first put some general guidelines:

      1. Practically speaking, if you have a lot of features and lesser samples, Mahalanobis algorithm tends to give misleading results (you can try it yourself), so the more features you have, the more samples you should provide.
      2. The covariance matrix must be Symmetric and Positive Definite to make the algorithm works, so you should check before proceeding.


      As it's already mentioned, Euclidean Metric fails to find the correct distance because it tries to get ordinary straight-line distance. Thus, if we have multi-dimensional space of variables, two points may look to have the same distance from the Mean, yet one of them is far away from the data cloud (i.e. it's an outlier).


      The solution is Mahalanobis Distance which makes something similar to the feature scaling via taking the Eigenvectors of the variables instead of the original axis.

      It applies the following formula:

      where:

      • x is the observation to find its distance;
      • m is the mean of the observations;
      • S is the Covariance Matrix.

      Refresher:

      The Covariance represents the direction of the relationship between two variables (i.e. positive, negative or zero), so it shows the strength of how one variable is related to the changes of the others.


      Implementation

      Consider this 6x3 dataset, in which each row represents a sample, and each column represents a feature of the given sample:

      First, we need to create a Covariance Matrix of the features of each sample, and that's why we set the parameter rowvar to False in the numpy.cov function, so each column now represents a variable:

      covariance_matrix = np.cov(data, rowvar=False)  
      # data here looks similar to the above table
      # in the picture
      

      Next, we find the Inverse of the Covariance Matrix:

      inv_covariance_matrix = np.linalg.inv(covariance_matrix)
      

      But before proceeding, we should check, as mentioned above, if the matrix and its inverse are Symmetric and Positive Definite. We use for this Cholesky Decomposition Algorithm, which, fortunately, is already implemented in numpy.linalg.cholesky:

      def is_pos_def(A):
          if np.allclose(A, A.T):
              try:
                  np.linalg.cholesky(A)
                  return True
              except np.linalg.LinAlgError:
                  return False
          else:
              return False 
      

      Then, we find the mean m of the variables on each feature (shall I say dimension) and save them in an array like this:

      vars_mean = []
      for i in range(data.shape[0]):
          vars_mean.append(list(data.mean(axis=0)))  
          # axis=0 means each column in the 2D array
      

      Note that I repeated each row just to avail of matrix subtraction as will be shown next.

      Next, we find x - m (i.e. the differential), but since we already have the vectorized vars_mean, all we need to do is:

      diff = data - vars_mean
      # here we subtract the mean of feature
      # from each feature of each example
      

      Finally, apply the formula like this:

      md = []
      for i in range(len(diff)):
          md.append(np.sqrt(diff[i].dot(inv_covariance_matrix).dot(diff[i]))) 
      

      Note the followings:

      • The dimension of the inverse of the covariance matrix is: number_of_features x number_of_features
      • The dimension of the diff matrix is similar to the original data matrix: number_of_examples x number_of_features
      • Thus, each diff[i] (i.e. row) is 1 x number_of_features.
      • So according to the Matrix Multiplication rule, the resulted matrix from diff[i].dot(inv_covariance_matrix) will be 1 x number_of_features; and when we multiply again by diff[i]; numpy automatically considers the latter as a column matrix (i.e. number_of_features x 1); so the final result will become a single value (i.e. no need for transpose).

      In order to detect outliers, we should specify a threshold; we do so by multiplying the Mean of the Mahalanobis Distance Results by the Extremeness Degree k; where k = 2.0 * std for extreme values, and 3.0 * std for the very extreme values; and that's according to the 68–95–99.7 rule (image for illustration from the same link):


      Putting All Together

      import numpy as np
      
      
      def create_data(examples=50, features=5, upper_bound=10, outliers_fraction=0.1, extreme=False):
          '''
          This method for testing (i.e. to generate a 2D array of data)
          '''
          data = []
          magnitude = 4 if extreme else 3
          for i in range(examples):
              if (examples - i) <= round((float(examples) * outliers_fraction)):
                  data.append(np.random.poisson(upper_bound ** magnitude, features).tolist())
              else:
                  data.append(np.random.poisson(upper_bound, features).tolist())
          return np.array(data)
      
      
      def MahalanobisDist(data, verbose=False):
          covariance_matrix = np.cov(data, rowvar=False)
          if is_pos_def(covariance_matrix):
              inv_covariance_matrix = np.linalg.inv(covariance_matrix)
              if is_pos_def(inv_covariance_matrix):
                  vars_mean = []
                  for i in range(data.shape[0]):
                      vars_mean.append(list(data.mean(axis=0)))
                  diff = data - vars_mean
                  md = []
                  for i in range(len(diff)):
                      md.append(np.sqrt(diff[i].dot(inv_covariance_matrix).dot(diff[i])))
      
                  if verbose:
                      print("Covariance Matrix:\n {}\n".format(covariance_matrix))
                      print("Inverse of Covariance Matrix:\n {}\n".format(inv_covariance_matrix))
                      print("Variables Mean Vector:\n {}\n".format(vars_mean))
                      print("Variables - Variables Mean Vector:\n {}\n".format(diff))
                      print("Mahalanobis Distance:\n {}\n".format(md))
                  return md
              else:
                  print("Error: Inverse of Covariance Matrix is not positive definite!")
          else:
              print("Error: Covariance Matrix is not positive definite!")
      
      
      def MD_detectOutliers(data, extreme=False, verbose=False):
          MD = MahalanobisDist(data, verbose)
          # one popular way to specify the threshold
          #m = np.mean(MD)
          #t = 3. * m if extreme else 2. * m
          #outliers = []
          #for i in range(len(MD)):
          #    if MD[i] > t:
          #        outliers.append(i)  # index of the outlier
          #return np.array(outliers)
      
          # or according to the 68–95–99.7 rule
          std = np.std(MD)
          k = 3. * std if extreme else 2. * std
          m = np.mean(MD)
          up_t = m + k
          low_t = m - k
          outliers = []
          for i in range(len(MD)):
              if (MD[i] >= up_t) or (MD[i] <= low_t):
                  outliers.append(i)  # index of the outlier
          return np.array(outliers)
      
      
      def is_pos_def(A):
          if np.allclose(A, A.T):
              try:
                  np.linalg.cholesky(A)
                  return True
              except np.linalg.LinAlgError:
                  return False
          else:
              return False
      
      
      data = create_data(15, 3, 10, 0.1)
      print("data:\n {}\n".format(data))
      
      outliers_indices = MD_detectOutliers(data, verbose=True)
      
      print("Outliers Indices: {}\n".format(outliers_indices))
      print("Outliers:")
      for ii in outliers_indices:
          print(data[ii])
      


      Result

      data:
       [[ 12   7   9]
       [  9  16   7]
       [ 14  11  10]
       [ 14   5   5]
       [ 12   8   7]
       [  8   8  10]
       [  9  14   8]
       [ 12  12  10]
       [ 18  10   6]
       [  6  12  11]
       [  4  12  15]
       [  5  13  10]
       [  8   9   8]
       [106 116  97]
       [ 90 116 114]]
      
      Covariance Matrix:
       [[ 980.17142857 1143.62857143 1035.6       ]
       [1143.62857143 1385.11428571 1263.12857143]
       [1035.6        1263.12857143 1170.74285714]]
      
      Inverse of Covariance Matrix:
       [[ 0.03021777 -0.03563241  0.0117146 ]
       [-0.03563241  0.08684092 -0.06217448]
       [ 0.0117146  -0.06217448  0.05757261]]
      
      Variables Mean Vector:
       [[21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8], [21.8, 24.6, 21.8]]
      
      Variables - Variables Mean Vector:
       [[ -9.8 -17.6 -12.8]
       [-12.8  -8.6 -14.8]
       [ -7.8 -13.6 -11.8]
       [ -7.8 -19.6 -16.8]
       [ -9.8 -16.6 -14.8]
       [-13.8 -16.6 -11.8]
       [-12.8 -10.6 -13.8]
       [ -9.8 -12.6 -11.8]
       [ -3.8 -14.6 -15.8]
       [-15.8 -12.6 -10.8]
       [-17.8 -12.6  -6.8]
       [-16.8 -11.6 -11.8]
       [-13.8 -15.6 -13.8]
       [ 84.2  91.4  75.2]
       [ 68.2  91.4  92.2]]
      
      Mahalanobis Distance:
       [1.3669401667524865, 2.1796331318432967, 0.7470525416547134, 1.6364973119931507, 0.8351423113609481, 0.9128858131134882, 1.397144258271586, 0.35603382066414996, 1.4449501739129382, 0.9668775289588046, 1.490503433100514, 1.4021488309805878, 0.4500345257064412, 3.239353067840299, 3.260149280200771]
      
      Outliers Indices: [13 14]
      
      Outliers:
      [106 116  97]
      [ 90 116 114]
      

      这篇关于马氏距离的多元离群值去除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆