Microsoft released an upgraded version of its DeepSpeed library for training highly complex AI models. The company claims that the new approach being used, known as 3D parallelism, is able to adapt to the varying needs of workload requirements to power very large models while balancing scaling efficiency.
Massive AI models comprising of billions of parameters have made it possible to improve efficiency in various domains. According to studies, such models are excellent performers because they can absorb the nuances of language, grammar, concepts, and context. This enables them to perform a range of tasks from summarizing speeches to generating code by browsing GitHub.
However, these training models require massive computational resources. A 2018 OpenAI analysis shows that between 2012 and 2018, the amount of computing power used in the largest AI training runs grew more than 300,000 times, which means it has doubled almost every 3.5 months.
This is where DeepSpeed comes in, as it is capable of training AI models consisting of up to trillions of parameters. It mainly utilizes three techniques to this end: data parallel training, model parallel training, and pipeline parallel training.
Now, conventionally, training a trillion-parameter model would require the memory storage of at least 400 NVIDIA A100 GPUs, each of which has a capacity of 40GB. In fact, Microsoft estimates that 4000 A100s running at 50% capacity for 100 days would complete the training for such a complex model.
DeepSpeed is able to make this process much simpler by dividing larger models into smaller components between four pipeline stages. Layers within each pipeline stage are further divided between four “worker” components, which do the actual job of training. Moreover, each pipeline is replicated across two data-parallel instances, and the workers are mapped to multi-GPU systems. These innovative techniques, and other performance improvements, allow DeepSpeed to train a trillion-parameter model across as few as 800 NVIDIA V100 GPUs.
“These [new techniques in DeepSpeed] offer extreme compute, memory, and communication efficiency, and they power model training with billions to trillions of parameters,” Microsoft wrote in a blog post. “The technologies also allow for extremely long input sequences and power on hardware systems with a single GPU, high-end clusters with thousands of GPUs, or low-end clusters with very slow ethernet networks … We [continue] to innovate at a fast rate, pushing the boundaries of speed and scale for deep learning training.”