处理Spark MLlib中的不平衡数据集 [英] Dealing with unbalanced datasets in Spark MLlib

查看:288
本文介绍了处理Spark MLlib中的不平衡数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个高度不平衡的数据集的特定二进制分类问题,我想知道是否有人尝试实现用于处理不平衡数据集的特定技术(例如

I'm working on a particular binary classification problem with a highly unbalanced dataset, and I was wondering if anyone has tried to implement specific techniques for dealing with unbalanced datasets (such as SMOTE) in classification problems using Spark's MLlib.

我正在使用MLLib的Random Forest实现,并且已经尝试了对较大类进行随机欠采样的最简单方法,但是效果不如我预期.

I'm using MLLib's Random Forest implementation and already tried the simplest approach of randomly undersampling the larger class but it didn't work as well as I expected.

对于您在类似问题上的经历提出的任何反馈意见,我们将不胜感激.

I would appreciate any feedback regarding your experience with similar issues.

谢谢

推荐答案

Spark ML的类权重

到目前为止,随机森林算法的类权重仍在开发中(请参见

Class weight with Spark ML

As of this very moment, the class weighting for the Random Forest algorithm is still under development (see here)

但是,如果您愿意尝试其他分类器-此功能已添加到 Logistic回归 .

But If you're willing to try other classifiers - this functionality has been already added to the Logistic Regression.

请考虑以下情况:我们在数据集中拥有80%的阳性(标签== 1),因此从理论上讲,我们希望对阳性类别进行欠采样". 逻辑损失目标函数应使用较高的权重对待负类(标签== 0).

Consider a case where we have 80% positives (label == 1) in the dataset, so theoretically we want to "under-sample" the positive class. The logistic loss objective function should treat the negative class (label == 0) with higher weight.

这是Scala中生成此权重的一个示例,我们为数据集中的每个记录在数据框中添加新列:

Here is an example in Scala of generating this weight, we add a new column to the dataframe for each record in the dataset:

def balanceDataset(dataset: DataFrame): DataFrame = {

    // Re-balancing (weighting) of records to be used in the logistic loss objective function
    val numNegatives = dataset.filter(dataset("label") === 0).count
    val datasetSize = dataset.count
    val balancingRatio = (datasetSize - numNegatives).toDouble / datasetSize

    val calculateWeights = udf { d: Double =>
      if (d == 0.0) {
        1 * balancingRatio
      }
      else {
        (1 * (1.0 - balancingRatio))
      }
    }

    val weightedDataset = dataset.withColumn("classWeightCol", calculateWeights(dataset("label")))
    weightedDataset
  }

然后,我们创建一个分类器,如下所示:

Then, we create a classier as follow:

new LogisticRegression().setWeightCol("classWeightCol").setLabelCol("label").setFeaturesCol("features")

有关更多详细信息,请在此处观看: https://issues.apache.org/jira /browse/SPARK-9610

For more details, watch here: https://issues.apache.org/jira/browse/SPARK-9610

您应该检查的另一个问题-您的功能是否对要预测的标签具有预测能力" .在欠采样后您仍然精度较低的情况下,可能与数据集天生不平衡这一事实无关.

A different issue you should check - whether your features have a "predictive power" for the label you're trying to predict. In a case where after under-sampling you still have low precision, maybe that has nothing to do with the fact that your dataset is imbalanced by nature.

我将进行探索性数据分析-如果分类器的性能不比随机选择好,则存在要素与类之间根本没有联系的风险.

I would do a exploratory data analysis - If the classifier doesn't do better than a random choice, there is a risk that there simply is no connection between features and class.

  • 对带有标签的每个功能进行相关性分析.
  • 为特征生成类特定的直方图(即,针对给定的每个类绘制数据的直方图 同一轴上的特征)也可以是一种很好的方式来显示 功能在这两个类别之间有很好的区别.
  • Perform correlation analysis for every feature with the label.
  • Generating class specific histograms for features (i.e. plotting histograms of the data for each class, for a given feature on the same axis) can also be a good way to show if a feature discriminates well between the two classes.

过度拟合-训练集上的错误率低而测试集上的错误率高,可能表明您使用了过于灵活的功能集而过度拟合.

Overfitting - a low error on your training set and a high error on your test set might be an indication that you overfit using an overly flexible feature set.

偏差方差-检查分类器是否存在高偏差或高方差问题.

Bias variance - Check whether your classifier suffers from a high bias or high variance problem.

  • 训练误差与验证误差-根据训练示例绘制验证误差和训练集误差(进行增量学习)
    • 如果线条似乎收敛到相同的值并且在末端接近,则您的分类器具有较高的偏见.在这种情况下,添加更多数据将无济于事. 将分类器更改为方差较大的分类器,或者简单地降低当前分类器的正则化参数.
    • 另一方面,如果两行之间相距很远,并且训练集误差小而验证误差高,则分类器的方差太大.在这种情况下,获取更多数据很有可能会有所帮助.如果获取更多数据后方差仍然太大,则可以增加正则化参数.
    • Training error vs. validation error - graph the validation error and training set error, as a function of training examples (do incremental learning)
      • If the lines seem to converge to the same value and are close at the end, then your classifier has high bias. In such case, adding more data won't help. Change the classifier for a one that has higher variance, or simply lower the regularization parameter of your current one.
      • If on the other hand the lines are quite far apart, and you have a low training set error but high validation error, then your classifier has too high variance. In this case getting more data is very likely to help. If after getting more data the variance will still be too high, you can increase the regularization parameter.

      这篇关于处理Spark MLlib中的不平衡数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆