题目：Performance Optimization for Large-Scale Distributed Machine-Learning Applications:A Swiss-Army-Knife Approach
Dr. Yonggang Wenis an associate professor with School of Computer Science and Engineering (SCSE) at Nanyang Technological University (NTU), Singapore. He is also the Assistant Chair for Innovation at SCSE and the founding director of SCSE Innovation Lab at NTU. He received his PhD degree in Electrical Engineering and Computer Science (minor in Western Literature) from Massachusetts Institute of Technology (MIT), Cambridge, USA, in 2008. Previously he has worked in Cisco to lead product development in content delivery network, which had a revenue impact of 3 Billion US dollars globally. Dr. Wen has published over 170 papers in top journals and prestigious conferences. His systems research has gained global recognitions. His work in Multi-Screen Cloud Social TV has been featured by global media (more than 1600 news articles from over 29 countries) and received ASEAN ICT Award 2013 (Gold Medal). His work on Cloud3DView for Data Centre Life-Cycle Management, as the only academia entry, has won the2015 Data Centre Dynamics Awards – APAC (the ‘Oscar’ award of data centre industry) and 2016 ASEAN ICT Awards (Gold Medal). He is the winner of 2017 Nanyang Award for Innovation and Entrepreneurship, the highest recognition at NTU. He is a co-recipient of Best Paper Awards at2016 IEEE Globecom, 2016 IEEE InfocomMuSIC Workshop, 2015 EAI Chinacom, 2014 IEEE WCSP, 2013 IEEE Globecom and 2012 IEEE EUC, and a co-recipient of 2015 IEEE Multimedia Best Paper Award. He serves on editorial boards for IEEE Communications Survey& Tutorials, IEEE Transactions on Multimedia, IEEE Transactions on Circuits and Systems for Video Technology, IEEE Wireless Communication, IEEE Transactions on Signal and Information Processing over Networks, IEEE Access Journal and Elsevier Ad Hoc Networks, and was elected as the Chair for IEEE ComSocMultimedia Communication Technical Committee (2014-2016). His research interests include cloud computing, green data center, big data analytics, multimedia network and mobile computing.
The parameter server (PS) framework is widely used to train machine learning (ML) models in parallel. Many distributed ML systems have been designed based on the PS framework, such as Petuum, MxNet, TensorFlow and Factorbird. It tackles the big data problem by having worker nodes perform data-parallel computation, and having server nodes maintain globally shared parameters. When training big models, worker nodes frequently pull parameters from server nodes and push updates to server nodes, often resulting in high communication overhead. Our investigations show that modern distributed ML applications could spend up to 5 times more time on communication than computation. To address this problem, we propose a novel communication layer for the PS framework named Parameter Flow (PF), which employs three techniques. First, we introduce an update-centric communication (UCC) model to exchange data between worker/server nodes via two operations: broadcast and push. Second, we develop a dynamic value-bounded ﬁlter (DVF) to reduce network trafﬁc by selectively dropping updates before transmission. Third, we design a tree-based streaming broadcasting (TSB) system to efﬁciently broadcast aggregated updates among worker nodes. Our proposed PF can signiﬁcantly reduce network trafﬁc and communication time. Extensive performance evaluations have showed that PF can speed up popular distributed ML applications by a factor of up to 4.3 in a dedicated cluster, and up to 8.2 in a shared cluster, compared to a generic PS system without PF.