A novel system developed by MIT researchers automatically ‘learns’ how to schedule data-processing operations across thousands of servers.
MIT system ‘learns’ how to optimally allocate workloads across thousands of servers to cut costs, save energy.
A novel system developed by MIT researchers automatically ‘learns’ how to schedule data-processing operations across thousands of servers — a task traditionally reserved for imprecise, human-designed algorithms. Doing so could help today’s power-hungry data centres run far more efficiently.
Data centres can contain tens of thousands of servers, which constantly run data-processing tasks from developers and users. Cluster scheduling algorithms allocate the incoming tasks across the servers, in real time, to efficiently utilise all available computing resources and get jobs done fast.
Traditionally, however, humans fine-tune those scheduling algorithms, based on some basic guidelines (‘policies’) and various trade-offs.
Code algorithm to get certain jobs done quickly
They may, for instance, code the algorithm to get certain jobs done quickly or split resource equally between jobs. But workloads — meaning groups of combined tasks — come in all sizes.
Therefore, it’s virtually impossible for humans to optimise their scheduling algorithms for specific workloads and, as a result, they often fall short of their true efficiency potential.
The MIT researchers instead offloaded all of the manual coding to machines. In a paper being presented at SIGCOMM, they describe a system that leverages ‘reinforcement learning’ (RL), a trial-and-error machine-learning technique, to tailor scheduling decisions to specific workloads in specific server clusters.
To do so, they built novel RL techniques that could train on complex workloads. In training, the system tries many possible ways to allocate incoming workloads across the servers, eventually finding an optimal trade-off in utilising computation resources and quick processing speeds. No human intervention is required beyond a simple instruction, such as, ‘minimise job-completion times’.
Compared to the best handwritten scheduling algorithms, the researchers’ system completes jobs about 20 to 30 percent faster, and twice as fast during high-traffic times.
Mostly, however, the system learns how to compact workloads efficiently to leave little waste. Results indicate the system could enable data centres to handle the same workload at higher speeds, using fewer resources.
‘Automatically figure out which strategy is better than others’
“If you have a way of doing trial and error using machines, they can try different ways of scheduling jobs and automatically figure out which strategy is better than others,” says Hongzi Mao, a PhD student in the Department of Electrical Engineering and Computer Science (EECS).
“That can improve the system performance automatically. And any slight improvement in utilisation, even one per cent, can save millions of dollars and a lot of energy in data centres.”[…]