在sklearn中使用roc_auc_score()函数计算auc,其计算方式和tf.metrics.auc()计算方式基本一致,也是通过极限逼近思想,计算roc曲线下面积的小梯形之和得到auc的。二者主要区别在于计算小梯形面积(计算小梯形面积时需要设置阈值计算tp,tn,fp,fn,进而计算tpr,fpr和小梯形面积)。第一,在tf.metrics.auc()中可以指定阈值个数,默认是200个阈值,一般设置该阈值为batch size比较合理。而在sklearn的roc_auc_score()函数实现中,直接指定了阈值个数为batch size。第二,阈值的产生方式也不同。tf.metrics.auc()是等距产生阈值的,roc_auc_score()直接以预测概率scores为阈值。

首先看roc_auc_score函数定义:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def roc_auc_score(y_true, y_score, average="macro", sample_weight=None):
"""Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC)
Examples
--------
>>> import numpy as np
>>> from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75

"""
def _binary_roc_auc_score(y_true, y_score, sample_weight=None):
if len(np.unique(y_true)) != 2:
raise ValueError("Only one class present in y_true. ROC AUC score "
"is not defined in that case.")

fpr, tpr, tresholds = roc_curve(y_true, y_score,
sample_weight=sample_weight)
return auc(fpr, tpr, reorder=True)

return _average_binary_score(
_binary_roc_auc_score, y_true, y_score, average,
sample_weight=sample_weight)

可以看到,传入参数主要有两个,y_true和 y_score。22行的_average_binary_score函数实际上调用了_binary_roc_auc_score(y_true, y_score)函数。在_binary_roc_auc_score()函数中,首先调用roc_curve()计算了fpr, tpr,然后调用了auc(fpr, tpr, reorder=True)得到auc值。auc()函数的实现和tf.metrics.auc()的实现基本一致,不再累述。这里重点看下如何产生fpr, tpr的。

roc_curve()定义如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def roc_curve(y_true, y_score, pos_label=None, sample_weight=None,
drop_intermediate=True):
"""Compute Receiver operating characteristic (ROC)
"""
fps, tps, thresholds = _binary_clf_curve(
y_true, y_score, pos_label=pos_label, sample_weight=sample_weight)

if drop_intermediate and len(fps) > 2:
optimal_idxs = np.where(np.r_[True,
np.logical_or(np.diff(fps, 2),
np.diff(tps, 2)),
True])[0]
fps = fps[optimal_idxs]
tps = tps[optimal_idxs]
thresholds = thresholds[optimal_idxs]

if tps.size == 0 or fps[0] != 0:
# Add an extra threshold position if necessary
tps = np.r_[0, tps]
fps = np.r_[0, fps]
thresholds = np.r_[thresholds[0] + 1, thresholds]

if fps[-1] <= 0:
warnings.warn("No negative samples in y_true, "
"false positive value should be meaningless",
UndefinedMetricWarning)
fpr = np.repeat(np.nan, fps.shape)
else:
fpr = fps / fps[-1]

if tps[-1] <= 0:
warnings.warn("No positive samples in y_true, "
"true positive value should be meaningless",
UndefinedMetricWarning)
tpr = np.repeat(np.nan, tps.shape)
else:
tpr = tps / tps[-1]

return fpr, tpr, thresholds

roc_curve函数的核心在5-6行,如何计算tp,fp。当知道tp, fp之后,tpr, fpr就好计算了,因为tn, fp只要知道labels就可以计算出来。这里先说结论,第5-6行的fps, tps分别表示不同阈值下,fp和tp的值,它们是一个array。

再看_binary_clf_curve函数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
def _binary_clf_curve(y_true, y_score, pos_label=None, sample_weight=None):
"""Calculate true and false positives per binary classification threshold.
"""
# Check to make sure y_true is valid
y_type = type_of_target(y_true)
if not (y_type == "binary" or
(y_type == "multiclass" and pos_label is not None)):
raise ValueError("{0} format is not supported".format(y_type))

check_consistent_length(y_true, y_score, sample_weight)
y_true = column_or_1d(y_true) #column_or_1d 校验维度
y_score = column_or_1d(y_score)
assert_all_finite(y_true)
assert_all_finite(y_score)

if sample_weight is not None:
sample_weight = column_or_1d(sample_weight)

# ensure binary classification if pos_label is not specified
classes = np.unique(y_true)
if (pos_label is None and
not (np.array_equal(classes, [0, 1]) or
np.array_equal(classes, [-1, 1]) or
np.array_equal(classes, [0]) or
np.array_equal(classes, [-1]) or
np.array_equal(classes, [1]))):
raise ValueError("Data is not binary and pos_label is not specified")
elif pos_label is None:
pos_label = 1.

# make y_true a boolean vector
y_true = (y_true == pos_label)

# sort scores and corresponding truth values
desc_score_indices = np.argsort(y_score, kind="mergesort")[::-1] #argsort升序排序得到索引, [::-1]是反转功能,这里就是降序
y_score = y_score[desc_score_indices]
y_true = y_true[desc_score_indices]
if sample_weight is not None:
weight = sample_weight[desc_score_indices]
else:
weight = 1.

# y_score typically has many tied values. Here we extract
# the indices associated with the distinct values. We also
# concatenate a value for the end of the curve.
distinct_value_indices = np.where(np.diff(y_score))[0]
threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1] # np.r_按列concat

# accumulate the true positives with decreasing threshold
tps = stable_cumsum(y_true * weight)[threshold_idxs]
if sample_weight is not None:
fps = stable_cumsum(weight)[threshold_idxs] - tps
else:
fps = 1 + threshold_idxs - tps
return fps, tps, y_score[threshold_idxs]

重点从35行开始,desc_score_indices得到了降序的y_score的索引。46行np.diff(y_score)得到了y_score的一阶差分,np.where(np.diff(y_score))[0]是获得一阶差分不为0的索引列表,[0]是因为np.where(np.diff(y_score))得到的是一个元组,元组的第一个元素才是索引列表。这里实际上就是对y_score做了一个去重操作,因为重复值作为阈值没有意义。36-37行得到了降序的y_score和y_true。第47行将索引值y_true.size - 1加入到了distinct_value_indices中,因为一阶差分之后少了一个值。

50行也是一个重点,stable_cumsum(y_true * weight)[threshold_idxs]首先对降序的y_true 进行了累加的操作,然后根据threshold_idxs获得了累加结果。因为正例是1,负例是0,所以这里实际上是获得了不同阈值下的真正例tp(tps)。而54行则获得了假正例fp(fps)。threshold_idxs的值不仅仅是索引,也代表了正负样例总和,所以1 + threshold_idxs - tps就是假正例。

综上,roc_auc_score实现方式和tf.metrics.auc基本一致,只是求小梯形面积时不一样,具体表现为:小梯形个数不一样(阈值个数不同)和小梯形面积不一样(阈值不同导致tp,fn,fp,fn不同,所以tpr,fpr不同进而导致小梯形面积不同)。综合roc_auc_score和tf.metrics.auc的实现,知道了两点:

  • 关于阈值的个数,使用tf.metrics.auc时,参数num_thresholds最好设置为batch size;
  • 关于阈值的取值,其实说不上哪种方式好。tf.metrics.auc时等距划分,roc_auc_score是直接取scores,仁者见仁智者见智吧。