如何在Python中构建升力图(又称收益图)? [英] How to build a lift chart (a.k.a gains chart) in Python?

查看:290
本文介绍了如何在Python中构建升力图(又称收益图)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚使用scikit-learn创建了一个模型,该模型可估计客户对某项报价做出响应的可能性.现在,我正在尝试评估我的模型.为此,我想绘制提升图.我了解了Lift的概念,但是我很难理解如何在python中实际实现它.

解决方案

提升/累积收益图不是评估模型的好方法(因为它不能用于模型之间的比较),而是一种在资源有限的情况下评估结果.要么因为每个结果(在营销场景中)要采取行动的成本,要么是您想忽略一定数量的有保证的选民,而只对那些处于围栏内的选民采取行动.如果您的模型非常好,并且所有结果的分类精度都很高,那么信心十足地排序您的结果就不会有太大的帮助.

 import sklearn.metrics
import pandas as pd

def calc_cumulative_gains(df: pd.DataFrame, actual_col: str, predicted_col:str, probability_col:str):

    df.sort_values(by=probability_col, ascending=False, inplace=True)

    subset = df[df[predicted_col] == True]

    rows = []
    for group in np.array_split(subset, 10):
        score = sklearn.metrics.accuracy_score(group[actual_col].tolist(),
                                                   group[predicted_col].tolist(),
                                                   normalize=False)

        rows.append({'NumCases': len(group), 'NumCorrectPredictions': score})

    lift = pd.DataFrame(rows)

    #Cumulative Gains Calculation
    lift['RunningCorrect'] = lift['NumCorrectPredictions'].cumsum()
    lift['PercentCorrect'] = lift.apply(
        lambda x: (100 / lift['NumCorrectPredictions'].sum()) * x['RunningCorrect'], axis=1)
    lift['CumulativeCorrectBestCase'] = lift['NumCases'].cumsum()
    lift['PercentCorrectBestCase'] = lift['CumulativeCorrectBestCase'].apply(
        lambda x: 100 if (100 / lift['NumCorrectPredictions'].sum()) * x > 100 else (100 / lift[
            'NumCorrectPredictions'].sum()) * x)
    lift['AvgCase'] = lift['NumCorrectPredictions'].sum() / len(lift)
    lift['CumulativeAvgCase'] = lift['AvgCase'].cumsum()
    lift['PercentAvgCase'] = lift['CumulativeAvgCase'].apply(
        lambda x: (100 / lift['NumCorrectPredictions'].sum()) * x)

    #Lift Chart
    lift['NormalisedPercentAvg'] = 1
    lift['NormalisedPercentWithModel'] = lift['PercentCorrect'] / lift['PercentAvgCase']

    return lift
 

要绘制累积增益图,可以在下面使用此代码.

     import matplotlib.pyplot as plt
    def plot_cumulative_gains(lift: pd.DataFrame):
        fig, ax = plt.subplots()
        fig.canvas.draw()

        handles = []
        handles.append(ax.plot(lift['PercentCorrect'], 'r-', label='Percent Correct Predictions'))
        handles.append(ax.plot(lift['PercentCorrectBestCase'], 'g-', label='Best Case (for current model)'))
        handles.append(ax.plot(lift['PercentAvgCase'], 'b-', label='Average Case (for current model)'))
        ax.set_xlabel('Total Population (%)')
        ax.set_ylabel('Number of Respondents (%)')

        ax.set_xlim([0, 9])
        ax.set_ylim([10, 100])

        labels = [int((label+1)*10) for label in [float(item.get_text()) for item in ax.get_xticklabels()]]

        ax.set_xticklabels(labels)

        fig.legend(handles, labels=[h[0].get_label() for h in handles])
        fig.show()
 

并可视化升降机:

     def plot_lift_chart(lift: pd.DataFrame):
        plt.figure()
        plt.plot(lift['NormalisedPercentAvg'], 'r-', label='Normalised \'response rate\' with no model')
        plt.plot(lift['NormalisedPercentWithModel'], 'g-', label='Normalised \'response rate\' with using model')
        plt.legend()
        plt.show()
 

结果如下:

我发现这些网站可供参考:

我发现MS链接的描述有些误导,但是Paul Te Braak链接非常有用.回答评论;

@Tanguy对于上面的累积增益图,所有计算均基于该特定模型的准确性.正如Paul Te Braak链接所指出的,我的模型的预测准确性如何达到100%(图表中的红线)?最好的情况(绿线)是我们在整个人口过程中能够以多快的速度达到与红线相同的精度(例如,我们的最佳累积收益方案).如果我们只是随机选择总体中每个样本的分类,则为蓝色.因此,完全累积收益和升幅图是为了了解该模型(仅该模型)如何在我不希望与整个人群互动的情况下给我带来更大的影响.

我使用累积收益图表的一种情况是针对欺诈案件,在这里我想知道对于X头来说,我们实际上可以忽略或优先考虑多少个应用程序(因为我知道模型可以对其进行预测)百分.在那种情况下,对于平均模型",我改为从真实的无序数据集中选择分类(以显示如何处理现有应用程序,以及如何(使用模型)我们可以对应用程序类型进行优先排序).

因此,对于比较模型,只需坚持使用ROC/AUC,一旦对所选模型感到满意,就可以使用累积增益/提升图来查看其对数据的响应.

I just created a model using scikit-learn which estimates the probability of how likely a client will respond to some offer. Now I'm trying to evaluate my model. For that I want to plot the lift chart. I understand the concept of lift, but I'm struggling to understand how to actually implement it in python.

解决方案

Lift/cumulative gains charts aren't a good way to evaluate a model (as it cannot be used for comparison between models), and are instead a means of evaluating the results where your resources are finite. Either because there's a cost to action each result (in a marketing scenario) or you want to ignore a certain number of guaranteed voters, and only action those that are on the fence. Where your model is very good, and has high classification accuracy for all results, you won't get much lift from ordering your results by confidence.

import sklearn.metrics
import pandas as pd

def calc_cumulative_gains(df: pd.DataFrame, actual_col: str, predicted_col:str, probability_col:str):

    df.sort_values(by=probability_col, ascending=False, inplace=True)

    subset = df[df[predicted_col] == True]

    rows = []
    for group in np.array_split(subset, 10):
        score = sklearn.metrics.accuracy_score(group[actual_col].tolist(),
                                                   group[predicted_col].tolist(),
                                                   normalize=False)

        rows.append({'NumCases': len(group), 'NumCorrectPredictions': score})

    lift = pd.DataFrame(rows)

    #Cumulative Gains Calculation
    lift['RunningCorrect'] = lift['NumCorrectPredictions'].cumsum()
    lift['PercentCorrect'] = lift.apply(
        lambda x: (100 / lift['NumCorrectPredictions'].sum()) * x['RunningCorrect'], axis=1)
    lift['CumulativeCorrectBestCase'] = lift['NumCases'].cumsum()
    lift['PercentCorrectBestCase'] = lift['CumulativeCorrectBestCase'].apply(
        lambda x: 100 if (100 / lift['NumCorrectPredictions'].sum()) * x > 100 else (100 / lift[
            'NumCorrectPredictions'].sum()) * x)
    lift['AvgCase'] = lift['NumCorrectPredictions'].sum() / len(lift)
    lift['CumulativeAvgCase'] = lift['AvgCase'].cumsum()
    lift['PercentAvgCase'] = lift['CumulativeAvgCase'].apply(
        lambda x: (100 / lift['NumCorrectPredictions'].sum()) * x)

    #Lift Chart
    lift['NormalisedPercentAvg'] = 1
    lift['NormalisedPercentWithModel'] = lift['PercentCorrect'] / lift['PercentAvgCase']

    return lift

To plot the cumulative gains chart, you can use this code below.

    import matplotlib.pyplot as plt
    def plot_cumulative_gains(lift: pd.DataFrame):
        fig, ax = plt.subplots()
        fig.canvas.draw()

        handles = []
        handles.append(ax.plot(lift['PercentCorrect'], 'r-', label='Percent Correct Predictions'))
        handles.append(ax.plot(lift['PercentCorrectBestCase'], 'g-', label='Best Case (for current model)'))
        handles.append(ax.plot(lift['PercentAvgCase'], 'b-', label='Average Case (for current model)'))
        ax.set_xlabel('Total Population (%)')
        ax.set_ylabel('Number of Respondents (%)')

        ax.set_xlim([0, 9])
        ax.set_ylim([10, 100])

        labels = [int((label+1)*10) for label in [float(item.get_text()) for item in ax.get_xticklabels()]]

        ax.set_xticklabels(labels)

        fig.legend(handles, labels=[h[0].get_label() for h in handles])
        fig.show()

And to visualise lift:

    def plot_lift_chart(lift: pd.DataFrame):
        plt.figure()
        plt.plot(lift['NormalisedPercentAvg'], 'r-', label='Normalised \'response rate\' with no model')
        plt.plot(lift['NormalisedPercentWithModel'], 'g-', label='Normalised \'response rate\' with using model')
        plt.legend()
        plt.show()

Result looks like:

I found these websites useful for reference:

Edit:

I found the MS link somewhat misleading in its descriptions, but the Paul Te Braak link very informative. To answer the comment;

@Tanguy for the cumulative gains chart above, all the calculations are based upon the accuracy for that specific model. As the Paul Te Braak link notes, how can my model's prediction accuracy reach 100% (the red line in the chart)? The best case scenario (the green line) is how quickly we can reach the same accuracy that the red line achieves over the course of the whole population (e.g. our optimum cumulative gains scenario). Blue is if we just randomly pick the classification for each sample in the population. So the cumulative gains and lift charts are purely for understanding how that model (and that model only) will give me more impact in a scenario where I'm not going to interact with the entire population.

One scenario I have used the cumulative gains chart is for fraud cases, where I want to know how many applications we can essentially ignore or prioritise (because I know that the model predicts them as well as it can) for the top X percent. In that case, for the 'average model' I instead selected the classification from the real unordered dataset (to show how existing applications were being processed, and how - using the model - we could instead prioritise types of application).

So, for comparing models, just stick with ROC/AUC, and once you're happy with the selected model, use the cumulative gains/ lift chart to see how it responds to the data.

这篇关于如何在Python中构建升力图(又称收益图)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆