Paper Review: A ConvNet for the 2020s

Author: Zhuang Liu1,2*, Hanzi Mao1, Chao-Yuan Wu1, Christoph Feichtenhofer1, Trevor Darrell2, Saining Xie1†

Affiliation: 1Facebook AI Research (FAIR), 2UC Berkeley

Code: https://github.com/facebookresearch/ConvNeXt

Introduction

A glance of ImageNet-1K classification results for ConvNets and vision Transformers.

1. Macro design

1.1 Changing stage compute ratio

Following Swin Transformer’s design, the number of blocks in each stage is adjusted from (3, 4, 6, 3) to (3, 3, 9, s3)

1.2 Changing stem to “Patchify”

Swin Transformer uses a similar “patchify” layer, but with a smaller patch size of 4 to accommodate the architecture’s multi-stage design.

-> Replace the ResNet-style stem cell with a patchify layer implemented using a 4Ă—4, stride 4 convolutional layer.

1.3 ResNeXt-ify

Use depthwise convolution and increase the network width to the same number of channels as Swin-T’s (from 64 to 96).

1.4 Inverted Bottleneck

One important design in every Transformer block is that it creates an inverted bottleneck, i.e., the hidden dimension of the MLP block is four times wider than the input dimension. (a,b)

Block modifications and resulted specifications. (a) is a ResNeXt block; in (b) we create an inverted bottleneck block and in (c) the position of the spatial depthwise conv layer is moved up.

1.5 Large Kernel Sizes

2. Micro design

2.1 Replacing ReLU with GELU

GELU is utilized in the most advanced Transformers, including Google’s BERT and OpenAI’s GPT-2, and, most recently, ViTs.

2.2 Fewer activation functions

Consider a Transformer block with key/query/value linear embedding layers, the projection layer, and two linear layers in an MLP block. There is only one activation function present in the MLP block.

2.3 Fewer normalization layers

Transformer blocks usually have fewer normalization layers.

2.4 Substituting BN with LN

LN has been used in Transformers, resulting in good performance across different application scenarios.

Block designs for a ResNet and a ConvNeXt.

2.5 Separate downsampling layers

In Swin Transformers, a separate downsampling layer is added between stages. Several LN layers also used in Swin Transformers: one before each downsampling layer, one after the stem, and one after the final global average pooling.

3. Experiment

Classification accuracy on ImageNet-1K.
Classification accuracy on ImageNet-22K.
Comparing isotropic ConvNeXt and ViT.
You can also comment on this