Machine learning compute infrastructures are primarily catered towards organisations looking to build infrastructure stacks on-premise.
Machine learning compute infrastructures are primarily catered towards organisations looking to build infrastructure stacks on-premise. There are six core capabilities needed in machine learning compute infrastructures to enable high-productivity artificial intelligence (AI) pipelines involving compute-intensive machine learning and deep neural network (DNN) models.
Compute acceleration technologies such as graphics processing units (GPUs) and application-specific integrated circuits (ASICs) can dramatically reduce training and inference time in AI workloads involving compute-intensive machine learning techniques and DNNs. Accelerators should be picked to match application needs, and frameworks must be configured for those specific accelerators to use their capabilities.
While there are diverse accelerator technologies in this market, including NEC Aurora Vector Engine, AMD GPUs and Nvidia GPUs, only a few of them have wide support for machine learning and DNN frameworks. Currently, the DNN training ecosystem is dominated by Nvidia GPUs because high-performance hardware can utilise unique capabilities such as tensor cores and NVLink. There is also a high degree of software integration all the way from libraries to frameworks.
Compute-intensive machine learning and DNN frameworks are scale-up-oriented. A higher number of accelerators in each compute node can dramatically reduce training times for large DNNs. Compute platforms addressing this market feature a high degree of variance in accelerator densities. Most suppliers support four accelerators per compute node, while performance-oriented configurations feature eight accelerators per compute node. In GPU-accelerated compute system s, some vendors offer 16 GPU compute nodes.
While the most common approach to scaling in compute-intensive machine learning and DNN frameworks tends to be scale-up-oriented, early adopters are also curating scale-out strategies. Uber’s Horovod enables distributed deep learning for DNN frameworks such as TensorFlow and PyTorch. IBM’s Distributed Deep Learning and Elastic Distributed Training are also designed to deliver scale-out capability when model size and complexity grow.
Nvidia’s Collection Communications Libraries (NCCL) also enable multi-GPU and multi-node scaling foundations for DNN frameworks. When selecting scale-out strategies, it is best to select solutions that are pre-optimised, easy to deploy and minimise total cost of ownership.
Because of the high density of accelerators, the manner in which the accelerators are connected to the compute node and how the compute node components interplay with accelerators can dramatically affect performance in compute-intensive machine learning- and DNN-based workloads.[…]