Spark中PCA的异常检测 [英] Anomaly detection with PCA in Spark

查看:497
本文介绍了Spark中PCA的异常检测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

文章中的内容如下:

•PCA算法基本上将数据读数从现有坐标系转换为新坐标系.

• PCA algorithm basically transforms data readings from an existing coordinate system into a new coordinate system.

•数据读数越接近新坐标系的中心,这些读数就越接近最佳值.

• The closer data readings are to the center of the new coordinate system, the closer these readings are to an optimum value.

•异常分数是使用读数与所有读数平均值之间的马氏距离计算的,该距离是变换后的坐标系的中心.

• The anomaly score is calculated using the Mahalanobis distance between a reading and the mean of all readings, which is the center of the transformed coordinate system.

有人能详细介绍一下使用PCA进行异常检测的情况(使用PCA分数和马氏距离)吗?我很困惑,因为PCA的定义是:PCA是一种统计过程,它使用正交变换将一组可能相关变量的观测值转换为一组线性不相关变量的值".当变量之间没有更多的相关性时,如何使用马氏距离?

Can anyone describe me more in detail about anomaly detection using PCA (using PCA scores and Mahalanobis distance)? I'm confused because the definition of PCA is: PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables". How to use Mahalanobis distance when there is no more correlation between the variables?

有人可以向我解释如何在Spark中执行此操作吗? pca.transform函数是否返回分数,我应在该分数中计算每次读取到中心的马氏距离?

Can anybody explain me how to do this in Spark? Does the pca.transform function returns the score where i should calculate the Mahalanobis distance for every reading to the center?

推荐答案

让我们假设您具有3维点的数据集. 每个点的坐标为(x, y, z). 这些(x, y, z)是尺寸. 由三个值e表示的点. G. (8, 7, 4).它称为输入向量.

Lets assume you have a dataset of 3-dimensional points. Each point has coordinates (x, y, z). Those (x, y, z) are dimensions. Point represented by three values e. g. (8, 7, 4). It called input vector.

在应用PCA算法时,您基本上将输入向量转换为新向量.它可以表示为将(x, y, z) => (v, w).

When you applying PCA algorithm you basically transform your input vector to new vector. It can be represented as function that turns (x, y, z) => (v, w).

示例:(8, 7, 4) => (-4, 13)

现在,您收到了一个较短的矢量(减小了尺寸的nr.),但是您的点仍然具有坐标,即(v, w).这意味着您可以使用马氏距离测量来计算两点之间的距离.实际上,与平均坐标相距较远的点是异常.

Now you received a vector, shorter one (you reduced an nr. of dimension), but your point still has coordinates, namely (v, w). This means that you can compute the distance between two points using Mahalanobis measure. Points that have a long distance from a mean coordinate are in fact anomalies.

示例解决方案:

import breeze.linalg.{DenseVector, inv}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{PCA, StandardScaler, VectorAssembler}
import org.apache.spark.ml.linalg.{Matrix, Vector}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.functions._

object SparkApp extends App {
  val session = SparkSession.builder()
    .appName("spark-app").master("local[*]").getOrCreate()
  session.sparkContext.setLogLevel("ERROR")
  import session.implicits._

  val df = Seq(
    (1, 4, 0),
    (3, 4, 0),
    (1, 3, 0),
    (3, 3, 0),
    (67, 37, 0) //outlier
  ).toDF("x", "y", "z")
  val vectorAssembler = new VectorAssembler().setInputCols(Array("x", "y", "z")).setOutputCol("vector")
  val standardScalar = new StandardScaler().setInputCol("vector").setOutputCol("normalized-vector").setWithMean(true)
    .setWithStd(true)

  val pca = new PCA().setInputCol("normalized-vector").setOutputCol("pca-features").setK(2)

  val pipeline = new Pipeline().setStages(
    Array(vectorAssembler, standardScalar, pca)
  )

  val pcaDF = pipeline.fit(df).transform(df)

  def withMahalanobois(df: DataFrame, inputCol: String): DataFrame = {
    val Row(coeff1: Matrix) = Correlation.corr(df, inputCol).head

    val invCovariance = inv(new breeze.linalg.DenseMatrix(2, 2, coeff1.toArray))

    val mahalanobois = udf[Double, Vector] { v =>
      val vB = DenseVector(v.toArray)
      vB.t * invCovariance * vB
    }

    df.withColumn("mahalanobois", mahalanobois(df(inputCol)))
  }

  val withMahalanobois: DataFrame = withMahalanobois(pcaDF, "pca-features")

  session.close()
}

这篇关于Spark中PCA的异常检测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆