如何用中等精度解释高AUC-ROC并在不平衡数据中调用? [英] How to explain high AUC-ROC with mediocre precision and recall in unbalanced data?

查看:106
本文介绍了如何用中等精度解释高AUC-ROC并在不平衡数据中调用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些机器学习的结果,我想尝试一下.任务是预测/标记爱尔兰"与非爱尔兰". Python 2.7的输出:

I have some machine learning results that I am trying to make sense of. The task is to predict/label "Irish" vs. "non-Irish". Python 2.7's output:

1= ir
0= non-ir
Class count:
0    4090942
1     940852
Name: ethnicity_scan, dtype: int64
Accuracy: 0.874921350119
Classification report:
             precision    recall  f1-score   support

          0       0.89      0.96      0.93   2045610
          1       0.74      0.51      0.60    470287

avg / total       0.87      0.87      0.87   2515897

Confusion matrix:
[[1961422   84188]
 [ 230497  239790]]
AUC-ir= 0.901238104773

如您所见,精度和召回率中等,但是AUC-ROC更高(〜0.90).我试图找出原因,我怀疑这是由于数据不平衡(大约1:5)造成的.基于混淆矩阵,并使用爱尔兰语作为目标(+),我计算了TPR = 0.51和FPR = 0.04.如果我将非爱尔兰语视为(+),则TPR = 0.96和FPR = 0.49.那么当FPR = 0.04时TPR只能是0.5时,我如何获得0.9的AUC?

As you can see, the precision and recall are mediocre, but the AUC-ROC is higher (~0.90). And I am trying to figure out why, which I suspect is due to data imbalance (about 1:5). Based on the confusion matrix, and using Irish as the target (+), I calculated the TPR=0.51 and FPR=0.04. If I am considering non-Irish as (+), then TPR=0.96 and FPR=0.49. So how can I get a 0.9 AUC while the TPR can be only 0.5 at FPR=0.04?

代码:

try:
    for i in mass[k]:
        df = df_temp # reset df before each loop
        #$$
        #$$ 
        if 1==1:
        ###if i == singleEthnic:
            count+=1
            ethnicity_tar = str(i) # fr, en, ir, sc, others, ab, rus, ch, it, jp
            # fn, metis, inuit; algonquian, iroquoian, athapaskan, wakashan, siouan, salish, tsimshian, kootenay
            ############################################
            ############################################

            def ethnicity_target(row):
                try:
                    if row[ethnicity_var] == ethnicity_tar:
                        return 1
                    else:
                        return 0
                except: return None
            df['ethnicity_scan'] = df.apply(ethnicity_target, axis=1)
            print '1=', ethnicity_tar
            print '0=', 'non-'+ethnicity_tar

            # Random sampling a smaller dataframe for debugging
            rows = df.sample(n=subsample_size, random_state=seed) # Seed gives fixed randomness
            df = DataFrame(rows)
            print 'Class count:'
            print df['ethnicity_scan'].value_counts()

            # Assign X and y variables
            X = df.raw_name.values
            X2 = df.name.values
            X3 = df.gender.values
            X4 = df.location.values
            y = df.ethnicity_scan.values

            # Feature extraction functions
            def feature_full_name(nameString):
                try:
                    full_name = nameString
                    if len(full_name) > 1: # not accept name with only 1 character
                        return full_name
                    else: return '?'
                except: return '?'

            def feature_full_last_name(nameString):
                try:
                    last_name = nameString.rsplit(None, 1)[-1]
                    if len(last_name) > 1: # not accept name with only 1 character
                        return last_name
                    else: return '?'
                except: return '?'

            def feature_full_first_name(nameString):
                try:
                    first_name = nameString.rsplit(' ', 1)[0]
                    if len(first_name) > 1: # not accept name with only 1 character
                        return first_name
                    else: return '?'
                except: return '?'

            # Transform format of X variables, and spit out a numpy array for all features
            my_dict = [{'last-name': feature_full_last_name(i)} for i in X]
            my_dict5 = [{'first-name': feature_full_first_name(i)} for i in X]

            all_dict = []
            for i in range(0, len(my_dict)):
                temp_dict = dict(
                    my_dict[i].items() + my_dict5[i].items()
                    )
                all_dict.append(temp_dict)

            newX = dv.fit_transform(all_dict)

            # Separate the training and testing data sets
            X_train, X_test, y_train, y_test = cross_validation.train_test_split(newX, y, test_size=testTrainSplit)

            # Fitting X and y into model, using training data
            classifierUsed2.fit(X_train, y_train)

            # Making predictions using trained data
            y_train_predictions = classifierUsed2.predict(X_train)
            y_test_predictions = classifierUsed2.predict(X_test)

插入的代码以进行重采样:

Inserted codes for resampling:

try:
    for i in mass[k]:
        df = df_temp # reset df before each loop
        #$$
        #$$ 
        if 1==1:
        ###if i == singleEthnic:
            count+=1
            ethnicity_tar = str(i) # fr, en, ir, sc, others, ab, rus, ch, it, jp
            # fn, metis, inuit; algonquian, iroquoian, athapaskan, wakashan, siouan, salish, tsimshian, kootenay
            ############################################
            ############################################

            def ethnicity_target(row):
                try:
                    if row[ethnicity_var] == ethnicity_tar:
                        return 1
                    else:
                        return 0
                except: return None
            df['ethnicity_scan'] = df.apply(ethnicity_target, axis=1)
            print '1=', ethnicity_tar
            print '0=', 'non-'+ethnicity_tar

            # Resampled
            df_resampled = df.append(df[df.ethnicity_scan==0].sample(len(df)*5, replace=True))

            # Random sampling a smaller dataframe for debugging
            rows = df_resampled.sample(n=subsample_size, random_state=seed) # Seed gives fixed randomness
            df = DataFrame(rows)
            print 'Class count:'
            print df['ethnicity_scan'].value_counts()

            # Assign X and y variables
            X = df.raw_name.values
            X2 = df.name.values
            X3 = df.gender.values
            X4 = df.location.values
            y = df.ethnicity_scan.values

            # Feature extraction functions
            def feature_full_name(nameString):
                try:
                    full_name = nameString
                    if len(full_name) > 1: # not accept name with only 1 character
                        return full_name
                    else: return '?'
                except: return '?'

            def feature_full_last_name(nameString):
                try:
                    last_name = nameString.rsplit(None, 1)[-1]
                    if len(last_name) > 1: # not accept name with only 1 character
                        return last_name
                    else: return '?'
                except: return '?'

            def feature_full_first_name(nameString):
                try:
                    first_name = nameString.rsplit(' ', 1)[0]
                    if len(first_name) > 1: # not accept name with only 1 character
                        return first_name
                    else: return '?'
                except: return '?'

            # Transform format of X variables, and spit out a numpy array for all features
            my_dict = [{'last-name': feature_full_last_name(i)} for i in X]
            my_dict5 = [{'first-name': feature_full_first_name(i)} for i in X]

            all_dict = []
            for i in range(0, len(my_dict)):
                temp_dict = dict(
                    my_dict[i].items() + my_dict5[i].items()
                    )
                all_dict.append(temp_dict)

            newX = dv.fit_transform(all_dict)

            # Separate the training and testing data sets
            X_train, X_test, y_train, y_test = cross_validation.train_test_split(newX, y, test_size=testTrainSplit)

            # Fitting X and y into model, using training data
            classifierUsed2.fit(X_train, y_train)

            # Making predictions using trained data
            y_train_predictions = classifierUsed2.predict(X_train)
            y_test_predictions = classifierUsed2.predict(X_test)

推荐答案

您的模型为测试集中得分的每一行输出概率P(介于0和1之间).汇总统计信息(精度,召回率等)是将P的单个值用作预测阈值,可能为P = 0.5,除非您在代码中对此进行了更改.但是,ROC包含更多信息,因此您可能不希望使用此默认值作为预测阈值,因此通过计算介于0和0之间的每个预测阈值上的真阳性与假阳性比率来绘制ROC. 1.

Your model outputs a probability P (between 0 and 1) for each row in the test set that it scores. The summary stats (precision, recall, etc) are for a single value of P as a prediction threshold, probably P=0.5, unless you've changed this in your code. However the ROC contains more information, the idea is that you probably won't want to use this default value as your prediction threshold, so the ROC is plotted by calculating the ratio of true positives to false positives, across every prediction threshold betwen 0 and 1.

如果您对数据中的非爱尔兰人进行了低采样,那么您会正确地估计AUC和精度会被高估;如果您的数据集只有5000行,那么在较大的训练集上运行模型将没有问题;只需重新平衡您的数据集(通过引导抽样来增加非爱尔兰人),直到您准确反映出样本人口为止即可.

If you've undersampled your non-Irish people in the data, then you're correct that the AUC and precision will be overestimated; if your dataset is only 5000 rows, then you will have no problem running your model on a larger training set; just rebalance your dataset (by bootstrap sampling to increase your non-Irish people) until your accurately reflect your sample population.

这篇关于如何用中等精度解释高AUC-ROC并在不平衡数据中调用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆