MoE (Mixture of Experts) LLMs - Full explanation
MoEs allow for much more compute-efficient pretraining but have historically struggled to generalize during fine-tuning, often leading to overfitting.
What are Mixture of Experts (MoE) LLMs?
The scale of a model significantly influences its quality. With a fixed computing budget, it is more effective to train a larger model for fewer steps than to train a smaller model for more steps.
Mixture of Experts (MoE) models allow for pretraining with much less compute, enabling significant scaling of the model or dataset size within the same compute budget as a dense model. Specifically, an MoE model can achieve the same quality as its dense counterpart much faster during pretraining.
In the context of transformer models, an MoE consists of two primary components:
- Sparse MoE layers: These replace the dense feed-forward network (FFN) layers and comprise multiple "experts" (e.g., 8), each being a neural network.
- Gate network or router: This component determines which tokens are sent to which expert.
While MoEs offer benefits like efficient pretraining and faster inference, they also present challenges:
- Training: MoEs allow for much more compute-efficient pretraining but have historically struggled to generalize during fine-tuning, often leading to overfitting.
- Inference: Only some parameters are used during inference, resulting in much faster inference compared to a dense model with the same number of parameters. However, all parameters need to be loaded into RAM, leading to high memory requirements.
The History of MoE
The concept of MoEs originated from the 1991 paper "Adaptive Mixture of Local Experts." The idea, similar to ensemble methods, involved a supervised procedure for a system of separate networks, each handling different subsets of the training data.
Between 2010 and 2015, two research areas significantly contributed to the advancement of MoEs:
- Experts as components: MoEs as layers in a multilayer network, enabling models to be both large and efficient.
- Conditional computation: Methods to dynamically activate or deactivate components based on the input token.
MoEs have allowed training multi-trillion parameter models, such as the open-sourced 1.6T parameters Switch Transformers, among others. MoEs have also been explored in Computer Vision, but this post focuses on the NLP domain.
What is Sparsity?
Sparsity leverages conditional computation, where only specific parts of the system are activated for each input, unlike dense models that use all parameters for every input.
This setup introduces challenges, such as uneven batch sizes and underutilization. However, using a low enough k (e.g., one or two), we can train and run inference much faster.
Load Balancing Tokens for MoEs
To prevent inefficiency from tokens being directed to just a few experts, an auxiliary loss is introduced to ensure all experts are given equal importance. This loss helps distribute training examples more evenly across all experts.
MoEs and Transformers
Transformers demonstrate that increasing the number of parameters enhances performance. Google explored this approach with GShard, scaling transformers to over 600 billion parameters.
GShard replaces every other FFN layer with an MoE layer, utilizing top-2 gating in both the encoder and the decoder.
Switch Transformers
The Switch Transformers work addresses training and fine-tuning instabilities. The authors released a 1.6 trillion parameter MoE with 2048 experts on Hugging Face.
Switch Transformers use a single expert strategy, reducing router computation and communication costs while preserving model quality.
Stabilizing Training with Router Z-loss
Router z-loss, introduced in ST-MoE, significantly improves training stability without quality degradation by penalizing large logits entering the gating network.
Scaling the Number of Experts
More experts lead to improved sample efficiency and faster speedup, but these are diminishing gains and require more VRAM for inference.
Fine-tuning MoEs
Sparse models are more susceptible to overfitting, prompting higher regularization methods. An open question is the use of auxiliary loss during fine-tuning. Token dropping appears to serve as a form of regularization that helps mitigate overfitting.
Recent experiments show MoEs may benefit more from instruction tuning than dense models.
When to Use Sparse MoEs vs Dense Models?
Sparse models are ideal for high-throughput scenarios involving multiple machines, while dense models are preferable for low-throughput scenarios with limited VRAM.
Making MoEs Efficient
Initial MoE implementations introduced MoE layers as a branching setup, which led to slow computation. Current approaches focus on making pretraining and inference more practical.
Parallelism Overview
- Data parallelism: The same weights are replicated across all cores, and the data is partitioned across cores.
- Model parallelism: The model is partitioned across cores, and the data is replicated across cores.
- Model and data parallelism: Both the model and the data are partitioned across cores, with different cores processing different data batches.
- Expert parallelism: Experts are distributed across different workers, with data partitioned across all cores.
Capacity Factor and Communication Costs
Increasing the capacity factor improves quality but also increases communication costs and memory of activations. A good starting point is using top-2 routing with a 1.25 capacity factor.
Serving Techniques
To handle the large number of parameters in MoEs:
- Distillation: Distilling a MoE back to its dense counterpart retains some sparsity gains.
- Routing Adjustments: Modify routing to assign full sentences or tasks to an expert.
- Aggregation of Experts: Merge weights of the experts, reducing the number of parameters needed at inference time.
More on Efficient Training
- FasterMoE (March 2022): Analyzes the performance of MoEs in distributed systems, examining skew expert popularity and fine-grained communication schedules.
- Megablocks (Nov 2022): Provides new GPU kernels for efficient sparse pretraining, mapping MoE layers as block-sparse operations.
Open Source MoEs
There are several open source projects to train MoEs:
Released open access MoEs:
Exciting directions of work:
- Further distilling Mixtral into a dense model
- Exploring model merging techniques of the experts
- Performing extreme quantization techniques of Mixtral