sklearn roc_auc_score源码解读

在sklearn中使用roc_auc_score()函数计算auc，其计算方式和tf.metrics.auc()计算方式基本一致，也是通过极限逼近思想，计算roc曲线下面积的小梯形之和得到auc的。二者主要区别在于计算小梯形面积（计算小梯形面积时需要设置阈值计算tp,tn,fp,fn，进而计算tpr,fpr和小梯形面积）。第一，在tf.metrics.auc()中可以指定阈值个数，默认是200个阈值，一般设置该阈值为batch size比较合理。而在sklearn的roc_auc_score()函数实现中，直接指定了阈值个数为batch size。第二，阈值的产生方式也不同。tf.metrics.auc()是等距产生阈值的，roc_auc_score()直接以预测概率scores为阈值。

首先看roc_auc_score函数定义：

def roc_auc_score(y_true, y_score, average="macro", sample_weight=None):
    """Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC)
    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.metrics import roc_auc_score
    >>> y_true = np.array([0, 0, 1, 1])
    >>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
    >>> roc_auc_score(y_true, y_scores)
    0.75

    """
    def _binary_roc_auc_score(y_true, y_score, sample_weight=None):
        if len(np.unique(y_true)) != 2:
            raise ValueError("Only one class present in y_true. ROC AUC score "
                             "is not defined in that case.")

        fpr, tpr, tresholds = roc_curve(y_true, y_score,
                                        sample_weight=sample_weight)
        return auc(fpr, tpr, reorder=True)

    return _average_binary_score(
        _binary_roc_auc_score, y_true, y_score, average,
        sample_weight=sample_weight)

可以看到，传入参数主要有两个，y_true和 y_score。22行的_average_binary_score函数实际上调用了_binary_roc_auc_score(y_true, y_score)函数。在_binary_roc_auc_score()函数中，首先调用roc_curve()计算了fpr, tpr，然后调用了auc(fpr, tpr, reorder=True)得到auc值。auc()函数的实现和tf.metrics.auc()的实现基本一致，不再累述。这里重点看下如何产生fpr, tpr的。

roc_curve()定义如下：

def roc_curve(y_true, y_score, pos_label=None, sample_weight=None,
              drop_intermediate=True):
    """Compute Receiver operating characteristic (ROC)
    """
    fps, tps, thresholds = _binary_clf_curve(
        y_true, y_score, pos_label=pos_label, sample_weight=sample_weight)

    if drop_intermediate and len(fps) > 2:
        optimal_idxs = np.where(np.r_[True,
                                      np.logical_or(np.diff(fps, 2),
                                                    np.diff(tps, 2)),
                                      True])[0]
        fps = fps[optimal_idxs]
        tps = tps[optimal_idxs]
        thresholds = thresholds[optimal_idxs]

    if tps.size == 0 or fps[0] != 0:
        # Add an extra threshold position if necessary
        tps = np.r_[0, tps]
        fps = np.r_[0, fps]
        thresholds = np.r_[thresholds[0] + 1, thresholds]

    if fps[-1] <= 0:
        warnings.warn("No negative samples in y_true, "
                      "false positive value should be meaningless",
                      UndefinedMetricWarning)
        fpr = np.repeat(np.nan, fps.shape)
    else:
        fpr = fps / fps[-1]

    if tps[-1] <= 0:
        warnings.warn("No positive samples in y_true, "
                      "true positive value should be meaningless",
                      UndefinedMetricWarning)
        tpr = np.repeat(np.nan, tps.shape)
    else:
        tpr = tps / tps[-1]

    return fpr, tpr, thresholds

roc_curve函数的核心在5-6行，如何计算tp,fp。当知道tp, fp之后，tpr, fpr就好计算了，因为tn, fp只要知道labels就可以计算出来。这里先说结论，第5-6行的fps, tps分别表示不同阈值下，fp和tp的值，它们是一个array。

再看_binary_clf_curve函数。

def _binary_clf_curve(y_true, y_score, pos_label=None, sample_weight=None):
    """Calculate true and false positives per binary classification threshold.
    """
    # Check to make sure y_true is valid
    y_type = type_of_target(y_true)
    if not (y_type == "binary" or
            (y_type == "multiclass" and pos_label is not None)):
        raise ValueError("{0} format is not supported".format(y_type))

    check_consistent_length(y_true, y_score, sample_weight)
    y_true = column_or_1d(y_true)  #column_or_1d 校验维度
    y_score = column_or_1d(y_score)
    assert_all_finite(y_true)
    assert_all_finite(y_score)

    if sample_weight is not None:
        sample_weight = column_or_1d(sample_weight)

    # ensure binary classification if pos_label is not specified
    classes = np.unique(y_true)
    if (pos_label is None and
        not (np.array_equal(classes, [0, 1]) or
             np.array_equal(classes, [-1, 1]) or
             np.array_equal(classes, [0]) or
             np.array_equal(classes, [-1]) or
             np.array_equal(classes, [1]))):
        raise ValueError("Data is not binary and pos_label is not specified")
    elif pos_label is None:
        pos_label = 1.

    # make y_true a boolean vector
    y_true = (y_true == pos_label)

    # sort scores and corresponding truth values
    desc_score_indices = np.argsort(y_score, kind="mergesort")[::-1] #argsort升序排序得到索引, [::-1]是反转功能，这里就是降序
    y_score = y_score[desc_score_indices]
    y_true = y_true[desc_score_indices]
    if sample_weight is not None:
        weight = sample_weight[desc_score_indices]
    else:
        weight = 1.

    # y_score typically has many tied values. Here we extract
    # the indices associated with the distinct values. We also
    # concatenate a value for the end of the curve.
    distinct_value_indices = np.where(np.diff(y_score))[0]
    threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1] # np.r_按列concat

    # accumulate the true positives with decreasing threshold
    tps = stable_cumsum(y_true * weight)[threshold_idxs]
    if sample_weight is not None:
        fps = stable_cumsum(weight)[threshold_idxs] - tps
    else:
        fps = 1 + threshold_idxs - tps
    return fps, tps, y_score[threshold_idxs]

重点从35行开始，desc_score_indices得到了降序的y_score的索引。46行np.diff(y_score)得到了y_score的一阶差分，np.where(np.diff(y_score))[0]是获得一阶差分不为0的索引列表，[0]是因为np.where(np.diff(y_score))得到的是一个元组，元组的第一个元素才是索引列表。这里实际上就是对y_score做了一个去重操作，因为重复值作为阈值没有意义。36-37行得到了降序的y_score和y_true。第47行将索引值y_true.size - 1加入到了distinct_value_indices中，因为一阶差分之后少了一个值。

50行也是一个重点，stable_cumsum(y_true * weight)[threshold_idxs]首先对降序的y_true 进行了累加的操作，然后根据threshold_idxs获得了累加结果。因为正例是1，负例是0，所以这里实际上是获得了不同阈值下的真正例tp(tps)。而54行则获得了假正例fp(fps)。threshold_idxs的值不仅仅是索引，也代表了正负样例总和，所以1 + threshold_idxs - tps就是假正例。

综上，roc_auc_score实现方式和tf.metrics.auc基本一致，只是求小梯形面积时不一样，具体表现为：小梯形个数不一样(阈值个数不同)和小梯形面积不一样(阈值不同导致tp,fn,fp,fn不同，所以tpr,fpr不同进而导致小梯形面积不同）。综合roc_auc_score和tf.metrics.auc的实现，知道了两点：

关于阈值的个数，使用tf.metrics.auc时，参数num_thresholds最好设置为batch size；
关于阈值的取值，其实说不上哪种方式好。tf.metrics.auc时等距划分，roc_auc_score是直接取scores，仁者见仁智者见智吧。