Machine Learning at Pace: Optimization Code Boosts Performance by 5x


Technology developed by a KAUST-led collaboration with Microsoft, Intel, and the University of Washington can dramatically enhance the speed of machine learning on parallelized computing systems. Source: © 2021 KAUST; Anastasia Serin

Optimizing network interface expedites training in large-scale machine learning models.

Introducing lightweight optimization code in high-speed network devices has enabled a KAUST-led collaboration to boost the speed of machine learning on parallelized computing systems five-fold.

This “in-network aggregation” technology, enhanced with systems architects and experts at Microsoft, Intel, and the University of Washington, can provide exciting speed enhancements using easily available programmable network hardware.

The primary advantage of artificial intelligence (AI) that gives it so much power to interact and understand with the world is the machine-learning step, in which the model is trained using massive collections of labeled training data. The more data the AI is trained on, the better the model is expected to perform when opened to distinct inputs.

The fresh burst of AI applications is mostly due to better machine learning and the use of more diverse datasets and larger models. However, implementing machine-learning computations is an enormously challenging task that depends upon vast arrays of computers operating the learning algorithm in parallel.

“The most challenging problem is to train deep-learning models at a large scale,” says Marco Canini from the KAUST study team. “The AI models can comprise of millions of parameters, and we can utilize hundreds of processors that require to work effectively in parallel. In such systems, communication among processors during incremental model updates quickly becomes the main performance bottleneck.”

The team discovered a possible solution in the latest network technology disclosed by Barefoot Networks, a division of Intel.

“We employ Barefoot Networks’ latest programmable dataplane networking hardware to offload part of the work executed during classified machine-learning training,” said Amedeo Sapio, a KAUST alumnus who has since joined the Barefoot Networks team at Intel. “Using this latest programmable networking hardware, rather than simply the network, to transfer data means that we can execute computations with the network paths.”

The major modification of the team’s SwitchML platform is to enable the network hardware to do the data collection task at each synchronization step during the model update period of the machine-learning process. Not simply executes this offload part of the computational load, but it also significantly lessens the amount of data transmission.

“Although the programmable switch dataplane can do operations pretty quick, the operations it can do are confined,” says Canini. “So our solution had to be simple enough for the hardware and yet flexible enough to solve challenges such as limited onboard memory capacity. SwitchML addresses this challenge by co-designing the communication network and the distributed training algorithm, executing an acceleration of up to 5.5 times compared to the state-of-the-art method.” 

Journal Reference

“Scaling Distributed Machine Learning with In-Network Aggregation” by Amedeo Sapio, Marco Canini, Changhoon Kim, Arvind Krishnamurthy, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Dan Ports, Peter Richtarik, and Masoud Moshref, April 2021, The 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’21). Link


Leave a Reply

Your email address will not be published. Required fields are marked *