PySpark:按AUC计算分组 [英] PySpark: Calculate grouped-by AUC

查看:441
本文介绍了PySpark:按AUC计算分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  • 火花版本:1.6.0

我尝试计算按字段id分组的AUC(ROC下的区域).给定以下数据:

I tried computing AUC (area under ROC) grouped by the field id. Given the following data:

# Within each key-value pair
# key is "id"
# value is a list of (score, label)
data = sc.parallelize(
         [('id1', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0)),
          ('id2', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0))
         ]

BinaryClassificationMetrics (score, label)列表,a>类可以计算AUC.

The BinaryClassificationMetrics class can calculate the AUC given a list of (score, label).

我想按键(即id1, id2)计算AUC.但是如何通过密钥将class映射到RDD?

I want to compute AUC by key (i.e. id1, id2). But how to "map" a class to an RDD by key?

我试图将BinaryClassificationMetrics包装在一个函数中:

I tried to wrap the BinaryClassificationMetrics in a function:

def auc(scoreAndLabels):
    return BinaryClassificationMetrics(scoreAndLabels).areaUnderROC

然后将包装器函数映射到每个值:

And then map the wrapper function to each values:

data.groupByKey()\
    .mapValues(auc)

但是(score, label)的列表实际上是mapValues()中的ResultIterable类型,而BinaryClassificationMetrics需要RDD.

But the list of (score, label) is in fact of type ResultIterable in mapValues() while the BinaryClassificationMetrics expects RDD.

是否有将ResultIterable转换为RDD的方法,所以可以应用auc函数?还是任何其他按AUC计算分组的解决方法(无需导入scikit-learn之类的第三方模块)?

Is there any approach of converting the ResultIterable to RDD so the the auc function can be applied? Or any other workaround for computing group-by AUC (without importing third-party modules like scikit-learn)?

推荐答案

代替使用BinaryClassificationMetrics,您可以使用

Instead of using BinaryClassificationMetrics you can use sklearn.metrics.auc and map each RDD element value and you'll get your AUC value per key:

from sklearn.metrics import auc

data = sc.parallelize([
         ('id1', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0)]),
         ('id2', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0)])])

result_aucs = data.map(lambda x: (x[0] + '_auc', auc(*zip(*x[1]))))
result_aucs.collect()


Out [1]: [('id1_auc', 0.15000000000000002), ('id2_auc', 0.15000000000000002)]

这篇关于PySpark:按AUC计算分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆