Devesh's research focus revolves around designing sustainable, resilient, and scalable systems with special emphasis on understanding and exploiting cross-layer interactions. His research interest also involves applying high performance computing and data analytics expertise to emerging inter-disciplinary research domains. His research publications have received best paper award nominations at conferences including Supercomputing (SC), Dependable Systems and Networks (DSN), and Parallel & Distributed Processing Symposium (IPDPS). His work has appeared in various conferences such as USENIX FAST, SC, DSN, HPCA, MICRO, IPDPS, and have been covered by the news media including Slashdot and HPCWire.
Before joining Northeastern, Devesh was a staff scientist at the Oak Ridge National Laboratory, a flagship multiprogram science and technology national laboratory of the United States Department of Energy (DOE). Devesh earned his Ph.D. in Electrical and Computer Engineering from North Carolina State University. Before that, he obtained his B.S. degree in Computer Science and Engineering from Indian Institute of Technology (IIT) Kanpur in India.
Areas of Expertise (3)
North Carolina State University: Ph.D., Electrical and Computer Engineering
Indian Institute of Technology, Kanpur: B.S., Computer Science
Media Appearances (1)
Supercomputing has a sustainability problem. This researcher is working to fix it
News @ Northeastern
Devesh Tiwari, a newly appointed assistant professor of electrical and computer engineering in the College of Engineering, says this just isn’t sustainable.
Tiwari is familiar with Titan and its operational needs. Prior to joining Northeastern this semester, he worked as a staff scientist at the Oak Ridge National Laboratory, which is funded by the U.S. Department of Energy and where the supercomputer is housed...
Rohan Garg, Tirthak Patel, Gene Cooperman, & Devesh Tiwari
Continued increase in computing power has enabled computational scientists to expedite the scientific research and discovery process in the past. Unfortunately, significant rise in the failure rates and a widening gap between compute and I/O system will significantly limit the usability of parallel computing systems in the future.
Bin Nie, Ji Xue, Saurabh Gupta, Tirthak Patel, Christian Engelmann, Evgenia Smirni, & Devesh Tiwari
Over the past decade, GPUs have become an integral part of mainstream high performance computing facilities thanks to the fact that they allow to simulate physical phenomena more quickly and accurately (i.e., at a finer granularity) –. As GPUs are more widely adopted in scale-out computing architectures, GPU soft errors become a critical challenge. Reliable execution of applications can lead to higher productivity and lower I/O overhead. However, understanding the source of GPU soft errors itself is challenging.
Saurabh Gupta, Tirthak Patel, Christian Engelmann, & Devesh Tiwari
Resilience is one of the key challenges in maintaining high efficiency of future extreme scale supercomputers. Researchers and system practitioners rely on field-data studies to understand reliability characteristics and plan for future HPC systems. In this work, we compare and contrast the reliability characteristics of multiple large-scale HPC production systems. Our study covers more than one billion compute node hours across five different systems over a period of 8 years. We confirm previous findings which continue to be valid, discover new findings, and discuss their implications.
Anshu Dubey, Hajime Fujita, Daniel Graves, Andrew Chien, & Devesh Tiwari
Supercomputing platforms are expected to have larger failure rates in the future because of scaling and power concerns. The memory and performance impact may vary with error types and failure modes. Therefore, localized recovery schemes will be important for scientific computations, including failure modes where application intervention is suitable for recovery. We present a resiliency methodology for applications using structured adaptive mesh refinement, where failure modes map to granularities within the application for detection and correction. This approach also enables parameterization of cost for differentiated recovery. The cost model is built with tuning parameters that can be used to customize the strategy for different failure rates in different computing environments. We also show that this approach can make recovery cost proportional to the failure rate.