paper summary: “Designing Network Design Spaces” (Regnet)

6 min readApr 4, 2021

arxiv: https://arxiv.org/abs/2003.13678

key points

finding optimal design space, instead of single model architecture
through experimentation with these design spaces, authors found a few practices that generally give better performance

What is “design space”?

design space is defined by model building parameters which have its own range, and therefore defines the range of possible model structures.

Why chase design space rather than a singular design?

by chasing design spaces instead of individual networks, we can discover general design principles that work across general settings.

How to evaluate design space?

The quality of a design space can be measured by evaluating the network architectures sampled from the design space, and evaluating the sampled architectures. In other words, the distribution of performance of sampled model architectures represents the quality of the design space.

In this paper, it follows Radosavovic et al who proposed to use the error distribution of the sampled model architectures. This means that for each architecture, it will be briefly trained, and in this work it will be trained on ImageNet dataset and the error rate of each trained model will be used. The paper does not explicitly say what error it measures, but I’m guessing the error they are referring to is the error rate against ImageNet testset, since they mentioned that they are training it with ImageNet dataset.

Network architectures can be infinitely various. How do we contain this?

Then how can be parameterize design space? If there is ultimate freedom, there are infinite ways to create an network structure by varying type and numbers and combinations of layers. This is impossible so the paper makes basic model building rules.

Popular structures such as resnet, resnext, efficientnet are consisted of basic blocks, and several stages. The number of blocks for each stage, the number of stages, the number of channels used in each stage can vary. Therefore this work confines the parameterization of network structure building in to a handful of aspects such as:

width
depth
groups within the block
etc…

If we decide to change these aspects and leave other aspects fixed, we now have a frameworked design space, and this is called AnyNet. Each variant of AnyNet has its own different restrictions of model structure parameters.

Anynet?

assumes fixed network blocks. parameters for structure are

number of blocks(depth)
block width(# of channels)
bottleneck ratio, group width (block params)

these parameters determine compute, # of params, memory of network.

The bigger structure of Anynet is also fixed, consisting of three parts.

stem: input processing layers
body: main part of network where the structure will be affected by architecture parameters.
head: final network predicting output classes.

fix stem and head and only change body.

body will have 4 stages, each stage will consist a sequence of identical blocks.

The first block of each stage will use stride=2 in its convolution layer which will also act as downsampling.

The block will be residual block with bottlenecks and group convolution. This work refers this as X block and thus the name for AnyNetX.

Input shape will be fixed to 224 x 224.

With these basic conditions, we have a block with three parameters(block width, bottleneck ratio, group width), and each stage has the parameter, # of block in stage. So for each stage, there are 4 parameters, and since we have 4 stages, in total we have 16 parameters to change network structure, or i.o.w we have 16 degrees of freedom.

The possible values for each parameter is restricted in the paper and even so, with 16 degrees of freedom, the possible combinations are very large.

This AnyNet with only basic restrictions will be named AnyNetXA.

The paper add more restrictions and for each new restriction, it names it as a new AnyNet with a new suffix. The new variants and their new restrictions are summarized as below.

RegNet

Reaching until AnyNetXE, if we select 20 best model from AnyNetXE and plot the width of blocks by order of blocks, we can see that the plots have a pattern.

The fitted curve explains well the growing block width as the block gets deeper, but we can’t use the widths predicted by this curve as it is since we can’t connect blocks that have different widths. Therefore we will use equal widths for each stage, and the widths for each stage will be obtained from the fitted curve. This is called quantized linear parameterization in the paper. An example of such structure is shown below.

Adding this restriction gives a new design space, and we call it RegNet, and this design space uses 6 parameters, which have much less possibility to cover than the first introduced AnyNetX.

Since we will use X block in RegNet, we can also call this as RegNetX.

The paper empirically finds that models sampled from RegNet perform better than models sampled from other design spaces, even when the regime and stage # changes.

RegNetX Analysis

The paper does experiments of RegNetX performance by varying regimes, such as higher FLOPs. By doing so, the authors test if well known common model building principles do live up to their fame.

Through this experiments, the authors find the following principles.

optimal depth: 20 blocks, approx 60 layers.
optimal bottleneck ratio = 1, meaning no bottleneck is better.
optimal width multiplier approx 2.5. this value is slightly higher than the commonly used value of 2.
optimal grouping width, initial block width increases along with complexity of model.
inverted bottleneck degrades performance
depthwise convs degrades performance
changing input image size degrades performance, even with higher FLOPs.

the last experiment result(changing input resolution) could be misleading since I first though this was weird because higher input resolution means more detail information and I expected the performance to increase as FLOPs increased, which the paper said it was opposite. After consideration I think my intuition isn’t wrong after all because when the paper says “higher FLOPs”, it means that it still constrains to having 4 stages and increases FLOPs by increasing # of blocks in a stage or increasing size of blocks. If higher resolution input was used, then it should have increased number of stages, but if this is restricted, then it does make sense that RegNet did not gain any performance improvement from increased input resolution.

RegNetY?

Authors found that RegNetX with squeeze-and-excitation(SE) yields good performance, and they named this design space as RegNetY.

Comparison with existing networks

In mobile regime, regnetX and regnetY with 600MF restriction performs mostly better than existing models.

comparing to resnet and resnext, with simliar FLOPs, regnet models perform better.

compared to efficientnet, under similar FLOPs efficientnet performs slightly better but in this case regnets have much less parameters and activations so the authors argue that regnet models can provide more ‘efficient, faster’ models than efficientnet.

Swish vs RELU

authors did experiments of using Swish activation function instead of RELU and found:

Swish outperforms RELU under low FLOPs, but in high FLOPs it is the oppositein depthwise conv setting(grouping width=1), swish is better. reason unknown.