PySpark:按 AUC 计算分组 [英] PySpark: Calculate grouped-by AUC
问题描述
- Spark 版本:1.6.0
我尝试计算按字段 id
分组的 AUC(ROC 下的区域).鉴于以下数据:
I tried computing AUC (area under ROC) grouped by the field id
. Given the following data:
# Within each key-value pair
# key is "id"
# value is a list of (score, label)
data = sc.parallelize(
[('id1', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0)),
('id2', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0))
]
BinaryClassificationMetrics 类可以计算给定 (score, label)
列表的 AUC.
The BinaryClassificationMetrics class can calculate the AUC given a list of (score, label)
.
我想通过键(即 id1, id2
)计算 AUC.但是如何通过键将 class
映射到 RDD?
I want to compute AUC by key (i.e. id1, id2
). But how to "map" a class
to an RDD by key?
我试图将 BinaryClassificationMetrics
包装在一个函数中:
I tried to wrap the BinaryClassificationMetrics
in a function:
def auc(scoreAndLabels):
return BinaryClassificationMetrics(scoreAndLabels).areaUnderROC
然后将包装函数映射到每个值:
And then map the wrapper function to each values:
data.groupByKey()\
.mapValues(auc)
但是 (score, label)
的列表实际上是 mapValues()
中的 ResultIterable
类型,而 BinaryClassificationMetrics
需要 RDD
.
But the list of (score, label)
is in fact of type ResultIterable
in mapValues()
while the BinaryClassificationMetrics
expects RDD
.
是否有任何方法可以将 ResultIterable
转换为 RDD
以便可以应用 auc
函数?或者任何其他按 AUC 计算分组的解决方法(不导入第三方模块,如 scikit-learn)?
Is there any approach of converting the ResultIterable
to RDD
so the the auc
function can be applied? Or any other workaround for computing group-by AUC (without importing third-party modules like scikit-learn)?
推荐答案
您可以使用 sklearn.metrics.auc 并映射每个 RDD 元素值,您将获得每个键的 AUC 值:
Instead of using BinaryClassificationMetrics
you can use sklearn.metrics.auc and map each RDD element value and you'll get your AUC value per key:
from sklearn.metrics import auc
data = sc.parallelize([
('id1', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0)]),
('id2', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0)])])
result_aucs = data.map(lambda x: (x[0] + '_auc', auc(*zip(*x[1]))))
result_aucs.collect()
Out [1]: [('id1_auc', 0.15000000000000002), ('id2_auc', 0.15000000000000002)]
这篇关于PySpark:按 AUC 计算分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!