Media
Publications:
Documents:
Videos:
Audio/Podcasts:
Biography
Devesh's research focus revolves around designing sustainable, resilient, and scalable systems with special emphasis on understanding and exploiting cross-layer interactions. His research interest also involves applying high performance computing and data analytics expertise to emerging inter-disciplinary research domains. His research publications have received best paper award nominations at conferences including Supercomputing (SC), Dependable Systems and Networks (DSN), and Parallel & Distributed Processing Symposium (IPDPS). His work has appeared in various conferences such as USENIX FAST, SC, DSN, HPCA, MICRO, IPDPS, and have been covered by the news media including Slashdot and HPCWire.
Before joining Northeastern, Devesh was a staff scientist at the Oak Ridge National Laboratory, a flagship multiprogram science and technology national laboratory of the United States Department of Energy (DOE). Devesh earned his Ph.D. in Electrical and Computer Engineering from North Carolina State University. Before that, he obtained his B.S. degree in Computer Science and Engineering from Indian Institute of Technology (IIT) Kanpur in India.
Areas of Expertise (3)
Machine Learning and Big Data Analytics
Security and Systems/Tools and Measurement
Sustainable and Resilient Systems
Education (2)
North Carolina State University: Ph.D., Electrical and Computer Engineering
Indian Institute of Technology, Kanpur: B.S., Computer Science
Links (3)
Media Appearances (1)
Supercomputing has a sustainability problem. This researcher is working to fix it
News @ Northeastern
2017-01-13
Devesh Tiwari, a newly appointed assistant professor of electrical and computer engineering in the College of Engineering, says this just isn’t sustainable. Tiwari is familiar with Titan and its operational needs. Prior to joining Northeastern this semester, he worked as a staff scientist at the Oak Ridge National Laboratory, which is funded by the U.S. Department of Energy and where the supercomputer is housed...
Articles (4)
Shiraz: Exploiting System Reliability and Application Resilience Characteristics to Improve Large Scale System Throughput
2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
Rohan Garg, Tirthak Patel, Gene Cooperman, & Devesh Tiwari
2018 Continued increase in computing power has enabled computational scientists to expedite the scientific research and discovery process in the past. Unfortunately, significant rise in the failure rates and a widening gap between compute and I/O system will significantly limit the usability of parallel computing systems in the future.
Machine Learning Models for GPU Error Prediction in a Large Scale HPC System
2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
Bin Nie, Ji Xue, Saurabh Gupta, Tirthak Patel, Christian Engelmann, Evgenia Smirni, & Devesh Tiwari
2018 Over the past decade, GPUs have become an integral part of mainstream high performance computing facilities thanks to the fact that they allow to simulate physical phenomena more quickly and accurately (i.e., at a finer granularity) [1]–[3]. As GPUs are more widely adopted in scale-out computing architectures, GPU soft errors become a critical challenge. Reliable execution of applications can lead to higher productivity and lower I/O overhead. However, understanding the source of GPU soft errors itself is challenging.
Failures in large scale systems: long-term measurement, analysis, and implications
Proceedings of the 30th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis
Saurabh Gupta, Tirthak Patel, Christian Engelmann, & Devesh Tiwari
2017 Resilience is one of the key challenges in maintaining high efficiency of future extreme scale supercomputers. Researchers and system practitioners rely on field-data studies to understand reliability characteristics and plan for future HPC systems. In this work, we compare and contrast the reliability characteristics of multiple large-scale HPC production systems. Our study covers more than one billion compute node hours across five different systems over a period of 8 years. We confirm previous findings which continue to be valid, discover new findings, and discuss their implications.
Granularity and the cost of error recovery in resilient AMR scientific applications
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Anshu Dubey, Hajime Fujita, Daniel Graves, Andrew Chien, & Devesh Tiwari
2016 Supercomputing platforms are expected to have larger failure rates in the future because of scaling and power concerns. The memory and performance impact may vary with error types and failure modes. Therefore, localized recovery schemes will be important for scientific computations, including failure modes where application intervention is suitable for recovery. We present a resiliency methodology for applications using structured adaptive mesh refinement, where failure modes map to granularities within the application for detection and correction. This approach also enables parameterization of cost for differentiated recovery. The cost model is built with tuning parameters that can be used to customize the strategy for different failure rates in different computing environments. We also show that this approach can make recovery cost proportional to the failure rate.
Social