PySpark:按AUC计算分组 [英] PySpark: Calculate grouped-by AUC
问题描述
- 火花版本:1.6.0
我尝试计算按字段id
分组的AUC(ROC下的区域).给定以下数据:
I tried computing AUC (area under ROC) grouped by the field id
. Given the following data:
# Within each key-value pair
# key is "id"
# value is a list of (score, label)
data = sc.parallelize(
[('id1', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0)),
('id2', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0))
]
BinaryClassificationMetrics 给定(score, label)
列表,a>类可以计算AUC.
The BinaryClassificationMetrics class can calculate the AUC given a list of (score, label)
.
我想按键(即id1, id2
)计算AUC.但是如何通过密钥将class
映射到RDD?
I want to compute AUC by key (i.e. id1, id2
). But how to "map" a class
to an RDD by key?
我试图将BinaryClassificationMetrics
包装在一个函数中:
I tried to wrap the BinaryClassificationMetrics
in a function:
def auc(scoreAndLabels):
return BinaryClassificationMetrics(scoreAndLabels).areaUnderROC
然后将包装器函数映射到每个值:
And then map the wrapper function to each values:
data.groupByKey()\
.mapValues(auc)
但是(score, label)
的列表实际上是mapValues()
中的ResultIterable
类型,而BinaryClassificationMetrics
需要RDD
.
But the list of (score, label)
is in fact of type ResultIterable
in mapValues()
while the BinaryClassificationMetrics
expects RDD
.
是否有将ResultIterable
转换为RDD
的方法,所以可以应用auc
函数?还是任何其他按AUC计算分组的解决方法(无需导入scikit-learn之类的第三方模块)?
Is there any approach of converting the ResultIterable
to RDD
so the the auc
function can be applied? Or any other workaround for computing group-by AUC (without importing third-party modules like scikit-learn)?
推荐答案
代替使用BinaryClassificationMetrics
,您可以使用
Instead of using BinaryClassificationMetrics
you can use sklearn.metrics.auc and map each RDD element value and you'll get your AUC value per key:
from sklearn.metrics import auc
data = sc.parallelize([
('id1', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0)]),
('id2', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0)])])
result_aucs = data.map(lambda x: (x[0] + '_auc', auc(*zip(*x[1]))))
result_aucs.collect()
Out [1]: [('id1_auc', 0.15000000000000002), ('id2_auc', 0.15000000000000002)]
这篇关于PySpark:按AUC计算分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!