paper review: “VOS: LEARNING WHAT YOU DON’T KNOW BY VIRTUAL OUTLIER SYNTHESIS”

9 min readApr 14, 2022

key points

presents VOS, a novel framework for OOD detection
generate virutal outlier in latent space and use it in training, helping the model to discriminate between ID(in-distribution) and OOD(out-of-distribution) data
adding a new loss that discriminates OOD / ID data during training, the model’s object detection performance can be improved.

Problems with OOD

When training a neural network model with any dataset, we often find that while the model trains well with training data, it seems to give awkward predictions with real-life data which is different from the training dataset. Anyone who has dealt with any sort of image classification model training should have confronted the case where a trained model in real-life situations sometimes show over confident predictions.

Figure 1. an example of out-of-distribution object incorrectly detected with high confidence.

According to the authors, such over-confident behavior is because while normal training will fit to have high and correct confidence it will also incorrectly show high confidence even in points far from trained data points, like the figure below.

What we would have wanted with training would be something like this:

In this ideal setting it would give high & correct confidence for data points used in training. For data points that are very different from that used in training(out-of-distribution points), it will show low confidence scores.

To make the model behave more like the last figure above, the paper proposes to create a virtual outlier from feature space, and utilize it in training to achieve such behavior. The paper names this method as “Virtual Outlier Synthesis(VOS)”.

How it works

Overall Idea

while training, calculate mean and variance in feature space for each class using a number of training samples
for each class, using the calculated mean & variance, randomly sample an outlier far enough from mean. This is called the “virtual outlier”
Add another classification branch at the end of the model which predicts if the given input feature vector is an outlier or not. This is called “OOD detection branch”
Using the training sample and the virtual outlier derived from this training sample, we calculate OOD detection loss and add this to the overall loss used for training. This is how the authors modify the training process to be “aware of unseen data” thereby shifting the model to behave more like in figure 3 rather than figure 2.

How to sample an outlier

The authors makes the assumption: “feature representation of object instances forms a class-conditional multivariate gaussian distribution” like in Figure 3.

For the “feature representation”, the authors chose the penultimate layer of the neural network, meaning it chose the output of the layer right before the last layer. The dimension of the output of the penultimate layer should be smaller than the input dimension, acting as a latent representation.

Based on the assumption that feature representation of object instances form a class-conditional gaussian distribution, we calculate the mean and covariance of feature representation for each class.

For each class, get N feature representations from N samples. Using these N features, we can calculate the mean and covariance.

When we want to sample an outlier in feature space for a specific class, we assume a gaussian distribution for that class in the feature space is shaped with the mean and covariance calculated above. In this setting, we can randomly sample a point in feature space where the probability is less than some value(epsilon) using the formula:

Note that finding the mean and covariance values for each class is done periodically throughout the training process.

What to do with the outlier?

Now that we have sampled an outlier, what do we do with it? We want to use this outlier to make the model learn that this is OOD while the training data point is in-distribution(ID).

In order to do this, we will be utilizing the classification branch’s logits to not only perform classification loss, but also calculate “uncertainty loss” which is related to discriminating ID and OOD from each other.

Normally, we do softmax on the classification logits to find which class the input would likely be. There has been previous works where they see this softmax operation as a form of calculating the “energy” for each class. These previous works showed that by perceiving the logits as “energy”, they were able to achieve a level of OOD detection.

In other words, the classification logits can not only be used to see which class an object is most likey to be, but also whether the such classification can be seen “normal(in-distribution)” or “a bit off(out of distribution)”.

In the softmax function, the denominator can be regarded as a “free energy” after applying log and making it negative(log partition function). This “energy” format is an indicator of whether an input looks like an ID sample or and OOD sample.

k= class number. f_k(x) = logit of class k in classification output

It is hard to explain why on earth this formula suddenly represents an “energy”. Based on my googling, the background to the appearance of this formula is related to gibbs energy(yes the “gibbs energy” from chemistry class) and more fundamental ideas used in science. I think this topic is outside the scope of this post and it deserves a post of its own.

We want ID samples to have negative energy values, and synthesized outliers to have positive energy. This is because for the ID sample(normal training data), we want the classification logits to have a spike for the correct label and a small value for others. On the other hand, for an OOD sample, we don’t want it to behave the same, so we want the logits to “not” spike and rather let the logits for each class to be comparable to each other thereby leading to a low confidence score. When there is a spike, the free energy value would likely be negative since a single large value will significantly drag down the free energy value. On the other hand, if the logits are comparable with each other and no spikes exist, then the free energy value would not be dragged down enough.

In this work, the authors use a threshold on the free energy function where if it is higher than zero then it will be considered to have positive energy and thus is regarded as OOD. On the other hand, if the free energy function is lower than zero, then it will be considered to have negative energy and thus to be ID. This “zero” threshold seems to be choice made by authors instead of having a significant threoretical background, according to my understanding.

Since we have both an ID sample and an OOD sample, we can combine this evaluation on both samples and use it is an objective function, i.e. use it as a loss. This is summed up as the “uncertainty loss”.

However, each term gives either 1 or 0 based on whether the energy value is positive or negative. This kind of function is impossible to backpropagate, so we smoothen this function like the following:

This format is backpropagatable and upholds to our objective. Looking at the first term, if we want to decrease this term then E(v) value should be bigger, assuming theta_u which is a trainable variable, is positive. For the second term, if we want to decrease this term, E(x) value should be smaller. Decreasing both terms aligns with the decrease of the overall uncertainty loss and also aligns with the preferred behavior of E(v) being positive and E(x) being negative.

The overall loss function is a weighted sum of L_loc, L_cls, and L_uncertainty.

If the task is image classification only, the L_loc can be dismissed. L_los and L_cls are losses that are familiar in any object detection/image classification task. As you can see the VOS method only adds more to the training process without any disruption.

OOD Detection During Inference

Until now we have discussed on how virtual outlier samples can be obtained and utilized in the training process. But the idea behind it can also be used in inference to see if the predicted output is OOD or not.

At inference, we can get the free energy value using the logits from classification output. The authors propose logistic regression using the free energy value and find an appropriate threshold value that can discriminate OOD or not. This threshold value is typically chosen so that a high fraction of ID data is correctly classified.

Overall Picture of VOS

This figure shows how a typical object detection model applied with VOS’s overall procedure would look like.

This figure summarizes how VOS is applied in training and inference phase.

Experiments

The authors used PASCAL VOC and BDD-100K respectively as ID dataset and images from MSCOCO and OpenImages as OOD data where the authors manually checked that OOD datasets had no overlapping categories that exist in training dataset.

For metrics, there are two for OOD detection

false positive rate(FPR95) of OOD samples when true positive rate of ID samples is at 95%
AUROC. (I’m guessing the ROC curve of OOD yes/no classification on OOD samples…)

and one for object detection performance

mAP (I’m guessing the authors are referring to mAP of the ID dataset, since they said that OOD datasets were ensured to be consisting of categories not found in ID dataset)

They used two model architectures: ResNet-50 and RegNetX-4.0GF.

Compared to other OOD detection methods, VOS shows superior performance.

VOS is good at OOD detection AND also enhances the original task: object detection. It is astonishing to see how including OOD detection objective has helped the object detection indirectly.

Here is an sample image of this experiment:

You can see that VOS applied model can remove false predictions by checking if they seem to be OOD, which would help greatly in increasing precision of the model. And who knows? with OOD classified boxes, one can use this information to do something useful later on!

The authors also compare VOS with other “synthetic OOD training” methods.

And even in here VOS performs well, which is significant since other synthetic methods have their own drawbacks and VOS seems to compensate those drawbacks efficiently.

Points that were confusing

VOS is not some new model architecture, but rather a training method that can be widely applied to existing training procedures to make the model to be OOD-aware.
VOS doesn’t modify the model architecture. It doesn’t required to add another output branch soley for predicting whether an input is OOD or not. This decision making is done by using the logits of the existing classification branch. No branch added!!

I think this is a simple and yet a very useful method that can help to increase model performance, which is very hard to come by these days where a lot of work is accompanied by large model size, large dataset, and larger everything.

The authors have conducted a more thorough analysis which was hard to summarize all in this post, so I highly recommend reading the details if you are interested for more.

Also this post is written based upon my understanding of the paper so I may have got some parts wrong, and if so then please let me know!