Measuring Congestion in High-Performance Datacenter Interconnects

Saurabh Jha, **Archit Patke**, Jim Brandt, Ann Gentile, Benjmain Lim, Mike Showerman, Greg Bauer, Larry Kaplan, Zbigniew Kalbarczyk, William Kramer and Ravishankar Iyer

**NSDI 2020**


Abstract

While it is widely acknowledged that network congestion in High Performance Computing (HPC) systems can significantly degrade application performance, there has been little to no quantification of congestion on credit-based interconnect networks. We present a methodology for detecting, extracting, and characterizing regions of congestion in networks. We have implemented the methodology in a deployable tool, Monet, which can provide such analysis and feedback at runtime. Using Monet, we characterize and diagnose congestion in the world’s largest 3D torus network of Blue Waters, a 13.3- petaflop supercomputer at the National Center for Supercomputing Applications. Our study deepens the understanding of production congestion at a scale that has never been evaluated before.