Seminar Series by Shubu Mukherjee
Architecture Design for Soft Errors
Shubu Mukherjee
Simulation and Pathfinding of Efficient and Reliable Systems Group,
Intel Corporation
Monday, 19th October: 11am to noon, CS101
Tuesday, 20th October: 5pm to 6pm, CS101
Wednesday, 21st October: 5pm to 6pm, CS101
Thursday, 22nd October: 5pm to 6pm, CS101
Abstract:
As kids many of us were fascinated by black holes and solar flares in deep space. Little did we know that particles from deep space could affect computing systems at the surface of the earth causing blue screens and incorrect bank balances. CMOS technology has shrunk to a point where radiation from deep space and packaging material have started causing such malfunction at an increasing rate. These radiation-induced errors are termed "soft" since the state of one or more bits in a silicon chip could flip temporarily without damaging the hardware. The lack of any appropriate shielding material has caused the design community to look for process, circuit, architectural, and software solutions to mitigate the effect of soft errors.
This tutorial will cover architectural techniques to tackle the soft error problem and is based on Shubu Mukherjee's recent book, "Architecture Design for Soft Errors."Computer architecture has long coped with various types of faults, including ones induced by radiation. For example, error correction codes are commonly used in memory systems. High-end systems have often used redundant copies of hardware to detect faults and recover from errors. Many of these solutions have, however, been prohibitively expensive and difficult to justify in mainstream commodity computing market.
The necessity to find cheaper reliability solutions has driven a whole new class of quantitative analysis of soft errors and corresponding solutions that mitigate their effects. This tutorial will cover the new methodologies for quantitative analysis of soft errors as well as novel cost-effective architectural techniques to mitigate them. This tutorial will also re-evaluate traditional architectural solutions in the context of the new quantitative analysis.
More specifically, this four part tutorial will cover:
* Introduction to soft errors
* Basic understanding of the physics of soft errors
* Prevalent circuit techniques to protect against soft errors
* How to analyze soft errors quantitatively at the architecture level
* How to reduce soft errors using error correction codes (without going into number theory)
* How to detect soft errors using redundant computation
* How to recover from a soft error once we detect an error
Much of the material in this tutorial will be based on Mukherjee's book, "Architecture Design for Soft Errors.
About The Speaker:
Shubu Mukherjee is a Principal Engineer and Director of Intel's SPEARS Group (Simulation and Pathfinding of Efficient and Reliable Systems). The SPEARS Group is responsible for spearheading architectural change and innovation in the delivery of enterprise processors and chipsets by building and supporting simulation and analytical models of performance, power, and reliability. Dr. Mukherjee is widely recognized both within and outside Intel as one of the experts on architecture design for soft errors. He has made pioneering contributions towards architectural vulnerability modeling for soft errors, Redundant Multithreading (RMT) techniques, the design of Intel's System Environment Monitoring Agent (SEMA) that runs on more than 300,000 processor cores within Intel, creation of performance modeling infrastructures called Cameroon (jointly with a team of Intel engineers) and Asim (jointly with Dr. Joel Emer), design of the Alpha 21364 interconnection network, and the creation of the first shared memory prediction scheme.
Prior to joining Intel, Dr. Mukherjee worked in Compaq for 3 years and Digital Equipment Corporation for 10 days. Dr. Mukherjee received his B.Tech. from the Indian Institute of Technology, Kanpur, where he serves as an adjunct faculty now. He got his M.S. and PhD from the University of Wisconsin-Madison. He is a Fellow of IEEE. He won the 2009 Maurice-Wilkes award for his work on soft errors. He was the General Chair of ASPLOS (Architectural Support for Programming Languages and Operating Systems), 2004. He has co-authored over 50 external papers. He holds 25 patents and has filed another 23 more in Intel. Dr. Mukherjee's book titled, "Architecture Design for Soft Errors" appeared in the market in February 2008. Dr. Mukherjee serves in the Editorial Board of IEEE Computer Architecture Letters (CAL), IEEE Micro, as an Associate Editor of IEEE Transactions of Secure and Dependable Computing (TDSC), in National Science Foundation (NSF) panels, in numerous technical program committees, in Intel Corporation's patent committee, and in the Board of Trustees of Merrimack Repertory Theatre.