AudioMAE

This paper introduces a AudioMAE, ie, simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. It sets new state-of-the-art among models with In-domain self-supervised pre-training, ie, not including out-of-domain supervised pre-training.

Model Architecture and Implementation

Preprocessing
- audio recording -> Mel spectrograms -> non-overlapped regular grid patches.
Embedding
- Linear projection: patches -> flattened and embedded
- add sinusoidal positional embeddings
- convolutional kernels with (16, 16) size and stride in time and frequency.
Masking
- unstructured high-ration (80%) masked during self-supervised training
- structured (time+frequency) low ratio during supervised fine-tuning
Encoder
- 12-layer ViT-Base, standard transformer, with 20% non-masked patches to reduce computation overhead n^2.
Decoder
- encoded patches padded with trainable masked tokens
- restore the order manually
- add decoder’s (fixed sinusoidal) positional embeddings
- standard transformer blocks again, 16-layer with shifted local attention, not global nor hybrid.
- linear head to predict and reconstruct input spectrogram
Local attention mechanism
- In decoder along with global self-attention, local attention mechanism which groups and separates the spectrogram patches in to local windows in self-attention for decoding.
- Two types are studied
  - Shifted window location inspired by Swin Transformers
    - Shift window attention by 50% between consecutive Transformer decoder layers
    - To pad the margin when shifting, cyclically shift the spectrogram to top-left direction
  - Hybrid window attention (global + local attention)
    - computes local attention within a window in all but the last few top layers
Loss
- MSE between the prediction and the input spectrogram, averaged over unknown patches.
Finetuning
- AudioMAE encoder alone is used
- Optional Masking: can remove portion of patches to further regularise learning from a limited view of spectrogram inputs, as a side effect , also reduces computation during fine-tuning.
- Average pooling layer followed by linear layer on top for fine-tuning in classification tasks.
Techniques:
Additional Details
- we transform raw waveform (pre-processed as mono channel under 16,000 sampling rate) into 128 Kaldi compatible Mel-frequency bands with a 25ms Hanning window that shifts every 10 ms. For a 10-second recording in AudioSet, the resulting spectrogram is of 1×1024×128 dimension.
- During fine-tuning, we employ a lower masking ratio (0.3 in time and 0.3 in frequency)

AudioMAE vs AST

Positional Embeddings: AST uses learned whereas AudioMAE uses sinusoidal.
Encoder Architecture: AST uses standard transformer encoder, whereas AudioMAE paper mentions using ViT-B which is same as standard transformer encoder.
AST is encoder only architecture for supervised learning, whereas AudioMAE is encoder-decoder for SSL.

Generalisability:

Limitations:

Extended Research Direction:

TBA
In paper multimodal self-supervised learning with a joint audio-visual MAE approach as these domains share natural correspondences in video data.

siv X siv

Explorer

AudioMAE

AudioMAE

Model Architecture and Implementation

AudioMAE vs AST

Generalisability:

Limitations:

Extended Research Direction:

Graph View

Backlinks

siv X siv

Explorer

AudioMAE

AudioMAE §

Model Architecture and Implementation §

AudioMAE vs AST §

Generalisability: §

Limitations: §

Extended Research Direction: §

Graph View

Backlinks

AudioMAE

Model Architecture and Implementation

AudioMAE vs AST

Generalisability:

Limitations:

Extended Research Direction: