SparkException:卡方检验期望因子 [英] SparkException: Chi-square test expect factors

查看：90 发布时间：2021/4/8 20:00:13 python apache-spark pyspark chi-squared

本文介绍了SparkException:卡方检验期望因子的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含42个要素和1个标签的数据集.我想在执行决策树以检测异常之前应用库spark ML的选择方法卡方选择器，但在卡方选择器应用期间遇到此错误:

I have a dataset containing 42 features and 1 label. I want to apply the selection method chi square selector of the library spark ML before executing Decision tree for the detection of anomaly but I meet this error during the applciation of chi square selector:

org.apache.spark.SparkException:由于阶段失败，作业被中止:阶段17.0中的任务0失败1次，最近一次失败:丢失的任务在阶段17.0中为0.0(TID 45，本地主机，执行程序驱动程序):org.apache.spark.SparkException:卡方检验期望因子(分类值)，但发现了10000个以上的不同值第11列.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 1 times, most recent failure: Lost task 0.0 in stage 17.0 (TID 45, localhost, executor driver): org.apache.spark.SparkException: Chi-square test expect factors (categorical values) but found more than 10000 distinct values in column 11.

这是我的源代码:

from pyspark.ml.feature import ChiSqSelector
selector = ChiSqSelector(numTopFeatures=1, featuresCol="features",outputCol="features2", labelCol="label")
result = selector.fit(dfa1).transform(dfa1)
result.show()

推荐答案

您会在错误消息中看到msg，您的 features 列在vector中包含10000个以上的不同值，看起来它们是连续的而不是分类的，ChiSq只能处理1万个类别，您不能增加此值.

As you can see in error msg your features column contains more than 10000 distinct values in vector and looks like they are continous not categorical , ChiSq can handle only 10k categories and you can't increase this value.

  /**
   * Max number of categories when indexing labels and features
   */
  private[spark] val maxCategories: Int = 10000

在这种情况下，您可以将 VectorIndexer 与 .setMaxCategories()参数<一起使用.10k准备您的数据.您可以尝试其他方法来准备数据，但是只有当向量中的不同值的计数> 10k时，该方法才起作用.

In this case you can use VectorIndexer with .setMaxCategories() parameter < 10k to prepare your data. You can try other methods to prepare data but it will not work until your count of distinct values in vector is > 10k.

这篇关于SparkException:卡方检验期望因子的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

SparkException:卡方检验期望因子 [英] SparkException: Chi-square test expect factors

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

SparkException:卡方检验期望因子 [英] SparkException: Chi-square test expect factors

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭