处理 Spark MLlib 中的不平衡数据集 [英] Dealing with unbalanced datasets in Spark MLlib

查看:41
本文介绍了处理 Spark MLlib 中的不平衡数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究具有高度不平衡数据集的特定二元分类问题,我想知道是否有人尝试实施特定技术来处理不平衡数据集(例如 SMOTE) 在使用 Spark 的 MLlib 的分类问题中.

I'm working on a particular binary classification problem with a highly unbalanced dataset, and I was wondering if anyone has tried to implement specific techniques for dealing with unbalanced datasets (such as SMOTE) in classification problems using Spark's MLlib.

我正在使用 MLLib 的随机森林实现,并且已经尝试了对较大类进行随机欠采样的最简单方法,但它没有像我预期的那样工作.

I'm using MLLib's Random Forest implementation and already tried the simplest approach of randomly undersampling the larger class but it didn't work as well as I expected.

如果您对类似问题的体验有任何反馈,我将不胜感激.

I would appreciate any feedback regarding your experience with similar issues.

谢谢,

推荐答案

Spark ML 的类权重

截至目前,随机森林算法的类权重仍在开发中(请参阅这里)

Class weight with Spark ML

As of this very moment, the class weighting for the Random Forest algorithm is still under development (see here)

但是如果您愿意尝试其他分类器 - 此功能 已经添加到 强>逻辑回归.

But If you're willing to try other classifiers - this functionality has been already added to the Logistic Regression.

考虑一个案例,我们在数据集中有 80% 的正样本(标签 == 1),因此理论上我们希望对正样本进行欠采样".逻辑损失目标函数应该以更高的权重对待负类(标签 == 0).

Consider a case where we have 80% positives (label == 1) in the dataset, so theoretically we want to "under-sample" the positive class. The logistic loss objective function should treat the negative class (label == 0) with higher weight.

这是 Scala 中生成此权重的示例,我们为数据集中的每条记录在数据框中添加一个新列:

Here is an example in Scala of generating this weight, we add a new column to the dataframe for each record in the dataset:

def balanceDataset(dataset: DataFrame): DataFrame = {

    // Re-balancing (weighting) of records to be used in the logistic loss objective function
    val numNegatives = dataset.filter(dataset("label") === 0).count
    val datasetSize = dataset.count
    val balancingRatio = (datasetSize - numNegatives).toDouble / datasetSize

    val calculateWeights = udf { d: Double =>
      if (d == 0.0) {
        1 * balancingRatio
      }
      else {
        (1 * (1.0 - balancingRatio))
      }
    }

    val weightedDataset = dataset.withColumn("classWeightCol", calculateWeights(dataset("label")))
    weightedDataset
  }

然后,我们创建一个分类器如下:

Then, we create a classier as follow:

new LogisticRegression().setWeightCol("classWeightCol").setLabelCol("label").setFeaturesCol("features")

有关更多详细信息,请在此处观看:https://issues.apache.org/jira/browse/SPARK-9610

For more details, watch here: https://issues.apache.org/jira/browse/SPARK-9610

您应该检查的另一个问题 - 您的特征是否对您尝试预测的标签具有预测能力".在欠采样后精度仍然很低的情况下,这可能与您的数据集本质上不平衡这一事实无关.

A different issue you should check - whether your features have a "predictive power" for the label you're trying to predict. In a case where after under-sampling you still have low precision, maybe that has nothing to do with the fact that your dataset is imbalanced by nature.

我会做一个探索性数据分析 - 如果分类器没有比随机选择做得更好,则存在特征和类之间根本没有联系的风险.

I would do a exploratory data analysis - If the classifier doesn't do better than a random choice, there is a risk that there simply is no connection between features and class.

  • 对带有标签的每个特征进行相关分析.
  • 为特征生成特定于类的直方图(即绘制每个类的数据直方图,对于给定的同一轴上的特征)也可以很好地表明特征很好地区分了这两个类.
  • Perform correlation analysis for every feature with the label.
  • Generating class specific histograms for features (i.e. plotting histograms of the data for each class, for a given feature on the same axis) can also be a good way to show if a feature discriminates well between the two classes.

过度拟合 - 训练集的低错误和测试集的高错误可能表明您使用过于灵活的特征集过度拟合.

Overfitting - a low error on your training set and a high error on your test set might be an indication that you overfit using an overly flexible feature set.

偏差方差 - 检查您的分类器是否存在高偏差或高方差问题.

Bias variance - Check whether your classifier suffers from a high bias or high variance problem.

  • 训练错误与验证错误 - 将验证错误和训练集错误绘制成图,作为训练示例的函数(进行增量学习)
    • 如果这些线似乎收敛到相同的值并且最后接近,那么您的分类器具有高偏差.在这种情况下,添加更多数据无济于事.将分类器更改为具有更高方差的分类器,或者简单地降低当前分类器的正则化参数.
    • 另一方面,如果线条相距很远,并且您的训练集错误率低但验证错误率高,那么您的分类器方差太大.在这种情况下,获取更多数据很可能会有所帮助.如果获取更多数据后方差仍然过高,您可以增加正则化参数.

    这篇关于处理 Spark MLlib 中的不平衡数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆