1. Quantization: Quantization aims to reduce the number of bits needed to store these weights by binning floating-point values into lower-precision buckets. This reduces memory usage with minimal impact on performance. Small precision losses are acceptable as long as the model’s performance is within the required levels. For instance, a weight value like 3.1457898 could be quantized to 3.1458 using a scheme that retains four decimal places. Such a scheme might lead to slight changes (during the backward pass of the training step, for example, a higher margin of error) while computing loss or while updating weights. Take, for instance, 4-bit quantization, which uses small bins where the density of weights is higher and fewer larger bins for weights away from the mean. The 4-bit float representation employs an intelligent approach based on the distribution of model weights. Most weights tend to cluster near zero, with minor differences requiring higher precision, while fewer weights have larger values. To accommodate this, asymmetric binning is used: smaller bins are allocated for values near the mean to maintain precision, while fewer larger bins handle outliers further from the mean.
2. Mixed precision: This is another technique to reduce memory and computational demands without sacrificing significant accuracy. These methods combine different numerical formats, such as float16, int8, and more, to optimize efficiency and performance during training or inference.
3. Data efficiency: Large datasets are costly to process, and redundant or noisy data can negatively impact model performance. Therefore, data efficiency techniques can be applied to achieve high model accuracy and generalization with a reduced or optimized dataset. This process includes filtering data for quality, reducing redundancy, and applying sampling techniques to emphasize high-value samples.
4. Sparse attention: Instead of computing attention weights for every pair of tokens in the input sequence, sparse attention focuses only on a subset of tokens, exploiting patterns in the data or task-specific properties. To put things into perspective, think about decoder-only architectures like GPT trained with an auto-regressive language objective. Such an objective puts a constraint on the attention layer to be causal, and thus, only the lower triangular attention matrix is useful (but the computation is still done for the whole matrix). Different architectures leverage specific patterns, like local or strided attention mechanisms, to bring in efficiency in computation time.
5. Flash attention: Flash attention takes the route of hardware-based improvements and efficiencies to compute attention scores. There are two popular techniques for sparse attention: Kernel fusion and Tiling.
Kernel fusion reduces the number of I/O operations by combining all steps (elementwise operations, matrix multiplication, softmax, etc.) into a single read-write operation. This technique is pretty effective during inference.
Tiling, on the other hand, breaks down the overall attention calculation into smaller and manageable groups of operations that fit into fast and low-latency GPU memory. For instance, instead of computing softmax across the entire attention matrix at once, FlashAttention computes it over smaller chunks in a numerically stable and tiled fashion, thus making use of faster memory without the need to store a large matrix.
6. Mixture of Experts (MoE) architecture: MOE is an advanced architecture designed to leverage a subset of components (or experts) rather than the whole architecture itself, thereby achieving higher scalability and efficiency. The Experts in this architecture are independent modules or blocks of the network, where each can be trained to specialize in a specific task. While the Router is a module that learns to select which experts to leverage (or activate) for a given input based on different criteria. The Router itself can be a neural network.
7. Efficient architectures: There are a number of different patterns and techniques that have been developed and leveraged by different architectural improvements over the years. Some of the popular architectures are Linformer, Reformer, and Big Bird.
Apart from pre-training optimizations, there are other techniques as well, such as fine-tuning and improvements in inference time. More recently, the availability and popularity of small language models and specialized hardware and frameworks has also contributed to significant improvements in the overall efficiency of resource-constrained environments.