Delay Sensitivity-driven Congestion Mitigation for HPC Systems
**Archit Patke**, Saurabh Jha, Haoran Qiu, Jim Brandt, Ann Gentile, Joe Greenseid, Zbigniew Kalbarczyk, and Ravishankar Iyer
**ICS 2021**
Abstract
Modern high-performance computing (HPC) systems concurrently execute multiple distributed applications that contend for the high-speed network leading to congestion. Consequently, application runtime variability and suboptimal system utilization are observedin production systems. To address these problems, we propose Netscope, a congestion mitigation framework based on a novel delay sensitivity metric that quantifies the impact of congestionon application runtime. Netscope uses delay sensitivity estimates to drive a congestion mitigation mechanism to selectively throttle applications that are less susceptible to congestion. We evaluate Netscope on two Cray Aries systems, including a production supercomputer, on common scientific applications. Our evaluation showsthat Netscope has a low training cost and accurately estimates the impact of congestion on application runtime with a correlation between 0.7 and 0.9. Moreover, Netscope reduces application tail runtime increase by up to 16.3× while improving the median systemutility by 12%.