hero image
Devesh Tiwari, Ph.D. - Global Resilience Institute. Boston, MA, US

Devesh Tiwari, Ph.D. Devesh Tiwari, Ph.D.

Assistant Professor, Electrical and Computer Engineering. Affiliated Assistant Professor Computer Science and Information Sciences, Northeastern University | Faculty Affiliate, Global Resilience Institute


Professor Tiwari focuses designing sustainable, resilient, and scalable systems.





loading image





Devesh's research focus revolves around designing sustainable, resilient, and scalable systems with special emphasis on understanding and exploiting cross-layer interactions. His research interest also involves applying high performance computing and data analytics expertise to emerging inter-disciplinary research domains. His research publications have received best paper award nominations at conferences including Supercomputing (SC), Dependable Systems and Networks (DSN), and Parallel & Distributed Processing Symposium (IPDPS). His work has appeared in various conferences such as USENIX FAST, SC, DSN, HPCA, MICRO, IPDPS, and have been covered by the news media including Slashdot and HPCWire.

Before joining Northeastern, Devesh was a staff scientist at the Oak Ridge National Laboratory, a flagship multiprogram science and technology national laboratory of the United States Department of Energy (DOE). Devesh earned his Ph.D. in Electrical and Computer Engineering from North Carolina State University. Before that, he obtained his B.S. degree in Computer Science and Engineering from Indian Institute of Technology (IIT) Kanpur in India.

Areas of Expertise (3)

Machine Learning and Big Data Analytics Security and Systems/Tools and Measurement Sustainable and Resilient Systems

Education (2)

North Carolina State University: Ph.D., Electrical and Computer Engineering

Indian Institute of Technology, Kanpur: B.S., Computer Science

Media Appearances (1)

Supercomputing has a sustainability problem. This researcher is working to fix it

News @ Northeastern  


Devesh Tiwari, a newly appointed assistant professor of electrical and computer engineering in the College of Engineering, says this just isn’t sustainable.

Tiwari is familiar with Titan and its operational needs. Prior to joining Northeastern this semester, he worked as a staff scientist at the Oak Ridge National Laboratory, which is funded by the U.S. Department of Energy and where the supercomputer is housed...

view more

Articles (4)

Shiraz: Exploiting System Reliability and Application Resilience Characteristics to Improve Large Scale System Throughput 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

Rohan Garg, Tirthak Patel, Gene Cooperman, & Devesh Tiwari


Continued increase in computing power has enabled computational scientists to expedite the scientific research and discovery process in the past. Unfortunately, significant rise in the failure rates and a widening gap between compute and I/O system will significantly limit the usability of parallel computing systems in the future.

view more

Machine Learning Models for GPU Error Prediction in a Large Scale HPC System 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

Bin Nie, Ji Xue, Saurabh Gupta, Tirthak Patel, Christian Engelmann, Evgenia Smirni, & Devesh Tiwari


Over the past decade, GPUs have become an integral part of mainstream high performance computing facilities thanks to the fact that they allow to simulate physical phenomena more quickly and accurately (i.e., at a finer granularity) [1]–[3]. As GPUs are more widely adopted in scale-out computing architectures, GPU soft errors become a critical challenge. Reliable execution of applications can lead to higher productivity and lower I/O overhead. However, understanding the source of GPU soft errors itself is challenging.

view more

Failures in large scale systems: long-term measurement, analysis, and implications Proceedings of the 30th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis

Saurabh Gupta, Tirthak Patel, Christian Engelmann, & Devesh Tiwari


Resilience is one of the key challenges in maintaining high efficiency of future extreme scale supercomputers. Researchers and system practitioners rely on field-data studies to understand reliability characteristics and plan for future HPC systems. In this work, we compare and contrast the reliability characteristics of multiple large-scale HPC production systems. Our study covers more than one billion compute node hours across five different systems over a period of 8 years. We confirm previous findings which continue to be valid, discover new findings, and discuss their implications.

view more

Granularity and the cost of error recovery in resilient AMR scientific applications Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Anshu Dubey, Hajime Fujita, Daniel Graves, Andrew Chien, & Devesh Tiwari


Supercomputing platforms are expected to have larger failure rates in the future because of scaling and power concerns. The memory and performance impact may vary with error types and failure modes. Therefore, localized recovery schemes will be important for scientific computations, including failure modes where application intervention is suitable for recovery. We present a resiliency methodology for applications using structured adaptive mesh refinement, where failure modes map to granularities within the application for detection and correction. This approach also enables parameterization of cost for differentiated recovery. The cost model is built with tuning parameters that can be used to customize the strategy for different failure rates in different computing environments. We also show that this approach can make recovery cost proportional to the failure rate.

view more