Machine learning compute infrastructures are primarily catered towards organisations looking to build infrastructure stacks on-premise.
Machine learning compute infrastructures are primarily catered towards organisations looking to build infrastructure stacks on-premise. There are six core capabilities needed in compute infrastructures to enable high-productivity () pipelines involving compute-intensive and deep neural network (DNN) models.
Compute acceleration technologies such as graphics processing units (GPUs) and application-specific integrated circuits (ASICs) can dramatically reduce training and inference time in workloads involving compute-intensive techniques and DNNs. Accelerators should be picked to match application needs, and frameworks must be configured for those specific accelerators to use their capabilities.
While there are diverse accelerator technologies in this market, including NEC Aurora Vector Engine, AMD GPUs and Nvidia GPUs, only a few of them have wide support for and DNN frameworks. Currently, the DNN training ecosystem is dominated by Nvidia GPUs because high-performance hardware can utilise unique capabilities such as tensor cores and NVLink. There is also a high degree of software integration all the way from libraries to frameworks.
Compute-intensive and DNN frameworks are scale-up-oriented. A higher number of accelerators in each compute node can dramatically reduce training times for large DNNs. Compute platforms addressing this market feature a high degree of variance in accelerator densities. Most suppliers support four accelerators per compute node, while performance-oriented configurations feature eight accelerators per compute node. In GPU-accelerated compute system s, some vendors offer 16 GPU compute nodes.
While the most common approach to scaling in compute-intensive and DNN frameworks tends to be scale-up-oriented, early adopters are also curating scale-out strategies. Uber’s Horovod enables distributed for DNN frameworks such as TensorFlow and PyTorch. IBM’s Distributed Deep Learning and Elastic Distributed Training are also designed to deliver scale-out capability when model size and complexity grow.
Nvidia’s Collection Communications Libraries (NCCL) also enable multi-GPU and multi-node scaling foundations for DNN frameworks. When selecting scale-out strategies, it is best to select solutions that are pre-optimised, easy to deploy and minimise total cost of ownership.
Because of the high density of accelerators, the manner in which the accelerators are connected to the compute node and how the compute node components interplay with accelerators can dramatically affect performance in compute-intensive - and DNN-based workloads.[…]