Spark 中使用 PCA 进行异常检测 [英] Anomaly detection with PCA in Spark

查看:30
本文介绍了Spark 中使用 PCA 进行异常检测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在文章中是这样写的:

• PCA 算法基本上将数据读数从现有坐标系转换为新坐标系.

• PCA algorithm basically transforms data readings from an existing coordinate system into a new coordinate system.

• 数据读数越靠近新坐标系的中心,这些读数就越接近最佳值.

• The closer data readings are to the center of the new coordinate system, the closer these readings are to an optimum value.

• 异常分数是使用读数与所有读数的平均值之间的马哈拉诺比斯距离计算得出的,平均值是转换坐标系的中心.

• The anomaly score is calculated using the Mahalanobis distance between a reading and the mean of all readings, which is the center of the transformed coordinate system.

谁能更详细地描述我使用 PCA(使用 PCA 分数和马氏距离)进行异常检测?我很困惑,因为 PCA 的定义是:PCA 是一种统计程序,它使用正交变换将一组可能相关变量的观察值转换为一组线性不相关变量的值.当变量之间不再有相关性时,如何使用马氏距离?

Can anyone describe me more in detail about anomaly detection using PCA (using PCA scores and Mahalanobis distance)? I'm confused because the definition of PCA is: PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables". How to use Mahalanobis distance when there is no more correlation between the variables?

谁能解释一下如何在 Spark 中执行此操作?pca.transform 函数是否返回我应该计算每个读数到中心的马氏距离的分数?

Can anybody explain me how to do this in Spark? Does the pca.transform function returns the score where i should calculate the Mahalanobis distance for every reading to the center?

推荐答案

假设您有一个 3 维点数据集.每个点都有坐标(x, y, z).那些 (x, y, z) 是维度.由三个值 e 表示的点.G.(8, 7, 4).它称为输入向量.

Lets assume you have a dataset of 3-dimensional points. Each point has coordinates (x, y, z). Those (x, y, z) are dimensions. Point represented by three values e. g. (8, 7, 4). It called input vector.

当您应用 PCA 算法时,您基本上将输入向量转换为新向量.它可以表示为将 (x, y, z) =>(v, w).

When you applying PCA algorithm you basically transform your input vector to new vector. It can be represented as function that turns (x, y, z) => (v, w).

示例:(8, 7, 4) =>(-4, 13)

现在你收到一个向量,更短的一个(你减少了一个维度),但你的点仍然有坐标,即 (v, w).这意味着您可以使用 Mahalanobis 测度计算两点之间的距离.离平均坐标很远的点实际上是异常点.

Now you received a vector, shorter one (you reduced an nr. of dimension), but your point still has coordinates, namely (v, w). This means that you can compute the distance between two points using Mahalanobis measure. Points that have a long distance from a mean coordinate are in fact anomalies.

示例解决方案:

import breeze.linalg.{DenseVector, inv}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{PCA, StandardScaler, VectorAssembler}
import org.apache.spark.ml.linalg.{Matrix, Vector}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.functions._

object SparkApp extends App {
  val session = SparkSession.builder()
    .appName("spark-app").master("local[*]").getOrCreate()
  session.sparkContext.setLogLevel("ERROR")
  import session.implicits._

  val df = Seq(
    (1, 4, 0),
    (3, 4, 0),
    (1, 3, 0),
    (3, 3, 0),
    (67, 37, 0) //outlier
  ).toDF("x", "y", "z")
  val vectorAssembler = new VectorAssembler().setInputCols(Array("x", "y", "z")).setOutputCol("vector")
  val standardScalar = new StandardScaler().setInputCol("vector").setOutputCol("normalized-vector").setWithMean(true)
    .setWithStd(true)

  val pca = new PCA().setInputCol("normalized-vector").setOutputCol("pca-features").setK(2)

  val pipeline = new Pipeline().setStages(
    Array(vectorAssembler, standardScalar, pca)
  )

  val pcaDF = pipeline.fit(df).transform(df)

  def withMahalanobois(df: DataFrame, inputCol: String): DataFrame = {
    val Row(coeff1: Matrix) = Correlation.corr(df, inputCol).head

    val invCovariance = inv(new breeze.linalg.DenseMatrix(2, 2, coeff1.toArray))

    val mahalanobois = udf[Double, Vector] { v =>
      val vB = DenseVector(v.toArray)
      vB.t * invCovariance * vB
    }

    df.withColumn("mahalanobois", mahalanobois(df(inputCol)))
  }

  val withMahalanobois: DataFrame = withMahalanobois(pcaDF, "pca-features")

  session.close()
}

这篇关于Spark 中使用 PCA 进行异常检测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆