MoE (Mixture of Experts) LLMs - Full explanation

MoEs allow for much more compute-efficient pretraining but have historically struggled to generalize during fine-tuning, often leading to overfitting.

What are Mixture of Experts (MoE) LLMs?

The scale of a model significantly influences its quality. With a fixed computing budget, it is more effective to train a larger model for fewer steps than to train a smaller model for more steps.

Mixture of Experts (MoE) models allow for pretraining with much less compute, enabling significant scaling of the model or dataset size within the same compute budget as a dense model. Specifically, an MoE model can achieve the same quality as its dense counterpart much faster during pretraining.

In the context of transformer models, an MoE consists of two primary components:

  • Sparse MoE layers: These replace the dense feed-forward network (FFN) layers and comprise multiple "experts" (e.g., 8), each being a neural network.
  • Gate network or router: This component determines which tokens are sent to which expert.

While MoEs offer benefits like efficient pretraining and faster inference, they also present challenges:

  • Training: MoEs allow for much more compute-efficient pretraining but have historically struggled to generalize during fine-tuning, often leading to overfitting.
  • Inference: Only some parameters are used during inference, resulting in much faster inference compared to a dense model with the same number of parameters. However, all parameters need to be loaded into RAM, leading to high memory requirements.

The History of MoE

The concept of MoEs originated from the 1991 paper "Adaptive Mixture of Local Experts." The idea, similar to ensemble methods, involved a supervised procedure for a system of separate networks, each handling different subsets of the training data.

Between 2010 and 2015, two research areas significantly contributed to the advancement of MoEs:

  • Experts as components: MoEs as layers in a multilayer network, enabling models to be both large and efficient.
  • Conditional computation: Methods to dynamically activate or deactivate components based on the input token.

MoEs have allowed training multi-trillion parameter models, such as the open-sourced 1.6T parameters Switch Transformers, among others. MoEs have also been explored in Computer Vision, but this post focuses on the NLP domain.

What is Sparsity?

Sparsity leverages conditional computation, where only specific parts of the system are activated for each input, unlike dense models that use all parameters for every input.

This setup introduces challenges, such as uneven batch sizes and underutilization. However, using a low enough k (e.g., one or two), we can train and run inference much faster.

Load Balancing Tokens for MoEs

To prevent inefficiency from tokens being directed to just a few experts, an auxiliary loss is introduced to ensure all experts are given equal importance. This loss helps distribute training examples more evenly across all experts.

MoEs and Transformers

Transformers demonstrate that increasing the number of parameters enhances performance. Google explored this approach with GShard, scaling transformers to over 600 billion parameters.

GShard replaces every other FFN layer with an MoE layer, utilizing top-2 gating in both the encoder and the decoder.

MoE Transformer Encoder from the GShard Paper

Switch Transformers

The Switch Transformers work addresses training and fine-tuning instabilities. The authors released a 1.6 trillion parameter MoE with 2048 experts on Hugging Face.

Switch Transformer Layer of the Switch Transformer paper

Switch Transformers use a single expert strategy, reducing router computation and communication costs while preserving model quality.

Stabilizing Training with Router Z-loss

Router z-loss, introduced in ST-MoE, significantly improves training stability without quality degradation by penalizing large logits entering the gating network.

Table from the ST-MoE paper showing which token groups were sent to which expert.

Scaling the Number of Experts

More experts lead to improved sample efficiency and faster speedup, but these are diminishing gains and require more VRAM for inference.

Fine-tuning MoEs

Sparse models are more susceptible to overfitting, prompting higher regularization methods. An open question is the use of auxiliary loss during fine-tuning. Token dropping appears to serve as a form of regularization that helps mitigate overfitting.

In the small task (left), we can see clear overfitting as the sparse model does much worse in the validation set. In the larger task (right), the MoE performs well. This image is from the ST-MoE paper.

Recent experiments show MoEs may benefit more from instruction tuning than dense models.

Sparse models benefit more from instruct-tuning compared to dense models. This image is from the MoEs Meets Instruction Tuning paper

When to Use Sparse MoEs vs Dense Models?

Sparse models are ideal for high-throughput scenarios involving multiple machines, while dense models are preferable for low-throughput scenarios with limited VRAM.

Making MoEs Efficient

Initial MoE implementations introduced MoE layers as a branching setup, which led to slow computation. Current approaches focus on making pretraining and inference more practical.

Parallelism Overview

  • Data parallelism: The same weights are replicated across all cores, and the data is partitioned across cores.
  • Model parallelism: The model is partitioned across cores, and the data is replicated across cores.
  • Model and data parallelism: Both the model and the data are partitioned across cores, with different cores processing different data batches.
  • Expert parallelism: Experts are distributed across different workers, with data partitioned across all cores.

Illustration from the Switch Transformers paper showing how data and models are split over cores with different parallelism techniques.

Capacity Factor and Communication Costs

Increasing the capacity factor improves quality but also increases communication costs and memory of activations. A good starting point is using top-2 routing with a 1.25 capacity factor.

Serving Techniques

To handle the large number of parameters in MoEs:

  • Distillation: Distilling a MoE back to its dense counterpart retains some sparsity gains.
  • Routing Adjustments: Modify routing to assign full sentences or tasks to an expert.
  • Aggregation of Experts: Merge weights of the experts, reducing the number of parameters needed at inference time.

More on Efficient Training

  • FasterMoE (March 2022): Analyzes the performance of MoEs in distributed systems, examining skew expert popularity and fine-grained communication schedules.
  • Megablocks (Nov 2022): Provides new GPU kernels for efficient sparse pretraining, mapping MoE layers as block-sparse operations.

Block-sparse matrix multiplication for differently sized experts and number of tokens (from MegaBlocks).

Open Source MoEs

There are several open source projects to train MoEs:

Released open access MoEs:

Exciting directions of work:

  • Further distilling Mixtral into a dense model
  • Exploring model merging techniques of the experts
  • Performing extreme quantization techniques of Mixtral