在Spark中使用Logistic回归计算估计的标准误差,Wald-Chi方统计量,p值 [英] Calculating standard error of estimate, Wald-Chi Square statistic, p-value with logistic regression in Spark

查看:548
本文介绍了在Spark中使用Logistic回归计算估计的标准误差,Wald-Chi方统计量,p值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在样本数据上构建Logistic回归模型.

I was trying to build Logistic regression model on a sample data.

我们可以从模型获得的输出是用于构建模型的特征的权重.

The output from the model we can get are the weights of features used to build the model.

找不到标准估计误差,Wald-Chi Square统计量,p值等的Spark API.

I could not find Spark API for standard error of estimate, Wald-Chi Square statistic, p-value etc.

我将下面的代码粘贴为示例

I am pasting my codes below as an example

import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
import org.apache.spark.mllib.evaluation.{BinaryClassificationMetrics, MulticlassMetrics}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}


    val sc = new SparkContext(new SparkConf().setAppName("SparkTest").setMaster("local[*]"))

    val sqlContext = new org.apache.spark.sql.SQLContext(sc);

    val data: RDD[String] = sc.textFile("C:/Users/user/Documents/spark-1.5.1-bin-hadoop2.4/data/mllib/credit_approval_2_attr.csv")


    val parsedData = data.map { line =>
      val parts = line.split(',').map(_.toDouble)
      LabeledPoint(parts(0), Vectors.dense(parts.tail))
    }

    //Splitting the data
    val splits: Array[RDD[LabeledPoint]] = parsedData.randomSplit(Array(0.7, 0.3), seed = 11L)
    val training: RDD[LabeledPoint] = splits(0).cache()
    val test: RDD[LabeledPoint] = splits(1)



    // Run training algorithm to build the model
    val model = new LogisticRegressionWithLBFGS()
      .setNumClasses(2)
      .run(training)
    // Clear the prediction threshold so the model will return probabilities
    model.clearThreshold
    print(model.weights)

模型重量输出为

[-0.03335987643613915,0.025215092730373874,0.22617842810253946,0.29415985532104943,-0.0025559467210279694,4.5242237280512646E-4]

只是一系列权重.

尽管我能够计算出精度,召回率,准确性,灵敏度和其他模型诊断信息.

Although I was able to calculate Precision, Recall, Accuracy, Sensitivity and other model diagnostics.

有没有一种方法可以计算Spark的估计值的标准误,Wald-Chi Square统计量和p值?

Is there a way I can calculate standard error of estimate, Wald-Chi Square statistic, p-value in Spark?

我很担心,因为R或SAS中有标准输出.

I am concerned since there is a standard output in R or SAS.

这与我们在Spark中使用的优化方法有关系吗?

Does this have to do something with the optimization method we are using in Spark?

这里我们使用L-BFGS或SGD.

Here we use L-BFGS or SGD.

可能我不知道评估方法.

May be I am not aware of the evaluation methodology.

任何建议将不胜感激.

推荐答案

以下方法将提供卡方检验的详细信息-

Following method will provide details of chi square test -

Statistics.chiSqTest(data)

输入数据

val obs: RDD[LabeledPoint] =
      sc.parallelize(
        Seq(
          LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)),
          LabeledPoint(1.0, Vectors.dense(1.0, 2.0, 0.0)),
          LabeledPoint(-1.0, Vectors.dense(-1.0, 0.0, -0.5)
          )
        )
      )
val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs)

针对标签的每个功能返回一个包含ChiSquaredTestResult的数组.

Returns an array containing the ChiSquaredTestResult for every feature against the label.

测试摘要,包括p值,自由度,测试统计量,使用的方法和原假设.

这篇关于在Spark中使用Logistic回归计算估计的标准误差,Wald-Chi方统计量,p值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆