This weekly update covers the last two weeks as I was simply too busy last week. Friend of the show Kajsa was visiting our Lyngby location for a week of focussed writing. The idea was that we have this mostly-finished paper on the ranking of classifiers so let’s just sit down and finish it. “Mostly finished” is a vague description, and I’m not sure we actually made it much more finished than it was. For example there was this quick little simulation study I wanted to do that was neither quick nor little once I started looking at it.
For this simulation—work still in progress—I need to calculate a lot of AUCs. The AUC (or area under the receiver operating characteristic curve for short) used to be a completely inscrutable statistic for me because you involve this story about radars and moving some kind of decision threshold back and forth and etc., etc., etc., etc. It’s nearly impossible to give it a sane explanation.
Or so I used to think. Some years ago I learned that the AUC does have a completely sane interpretation. You’ve got some data that can be classified as either T or F and you have a model that spits out a T-ness score for each data point. A higher score means that it’s more likely that this data point belongs to class T. Could be a probability, but it doesn’t have to be. AUC is the probability that a randomly chosen T will have a higher score than a randomly chosen F. That’s it. End of explanation. No need to invoke any curve, nor threshold, nor hypothetical radar.
In numerical linear algebra it’s a folk theorem that if you’re inverting a matrix you’re probably doing something wrong. Similarly if you’re calculating an AUC by first building the ROC curve and then integrating you’re probably doing it wrong. I figured this out last week; my journey was something like this:
pROC::auc
in R
because
this is what I had been doing when I didn’t need actual millions of the
things. This is very slow.pROC::auc
, and indeed it
is.bigstatsr
package says that they have another fast method, and indeed it is a
little faster.bigstatsr::AUC
function spends
almost half its time in sanity checks. This isn’t a problem for the
size of data that bigstatsr
is made for: the asserts will
proportionally take up no time. But for me it matters because I’m only
doing like 1000 scores at a time.bigstatsr::AUC
just calls another function
after those asserts. This function is “not exported,” but actually
R
doesn’t enforce that. You just do
bigstatsr:::AUC2
(notice :::
instead of
::
) and you access the not-exported function to skip the
sanity checks. I think the only way to speed up my simulation now is
parallelizing, which is easy.Both the bigstatsr
approach and the one from the SO
answer are based on the observation that, as I say, AUC is
mathematically equivalent to a probability about ranking (see Hanley and McNeil
(1982) for more). If you sort the scores it’s a really fast thing
to count pairs of T, F where the score of T is greater than the score of
F. You can do it in
a single pass. By the same approach you could in fact also build the
ROC curve in a single pass, but then you’d still have to do your stupid
little integration if AUC was what you wanted.
Backlinks: