|
-
Workshop on Architectures and Compilers for Multithreading
December 13-15, 2007
Indian Institute of Technology, Kanpur
Abstracts
Sarita Adve
Title: Memory Consistency Models
Abstract: The memory consistency model for a shared-memory multiprocessor system defines
the values a read may return, and typically involves a tradeoff between
programmability, performance, and portability. It has arguably been one of the
most challenging and contentious areas in shared memory system specification
for several years. Over the last few years, researchers and developers from the
languages community have made a concerted effort to achieve consensus on the
language level memory consistency model. A new model for the Java programming
language was approved in 2005 and a model for C++ is almost finalized. Partly
in response to this work, most hardware vendors have now published memory model
specifications that are compatible with the language level models. These models
reflect a convergence of about 20 years of research in the area. I will
summarize this research and its recent impact on hardware and language-level
consistency models, the remaining open problems in the area, and the
implications for hardware and compiler writers moving ahead.
Bio: Sarita V. Adve is Professor in the Department of Computer Science at the
University of Illinois at Urbana-Champaign. She received a Ph.D. from the
University of Wisconsin - Madison in 1993. She recently co-developed the memory
consistency model for Java and the soon to be finalized model for C++, both of
which are strongly influenced by her extensive work in the area over almost 20
years. Her other major current research focus is hardware reliability which
also has deep implications for multicore architectures. Professor Adve was
named a UIUC University Scholar in 2004, received an Alfred P. Sloan Research
Fellowship in 1998, IBM Faculty/Partnership Awards in 2005, 1998, and 1997, and
a National Science Foundation CAREER award in 1995. She served on the National
Science Foundation's CISE directorate's advisory committee from 2003 to 2005.
Frances Allen
Title: Languages and Compilers for Multicore Computing Systems
Abstract: Multi-core computers are ushering in a new era of parallelism
everywhere. As more cores (and parallelism) are added, the potential
performance of the hardware will increase at the traditional rate. But
how will users and applications take advantage of all the parallelism?
This talk will review some of the history of languages and compilers for
high performance systems then consider their ability to deliver the
performance potential of multi-core systems. The talk is intended to
encourage the exploration of new approaches.
Bio: Fran Allen is an IBM Fellow Emerita at the T. J. Watson Research
Laboratory with a specialty in compilers and program optimization for
high performance computers. This work led to Fran being named the
recipient of ACM's 2006 Turing Award "For pioneering contributions to
the theory and practice of optimizing compiler techniques that laid the
foundation for modern optimizing compilers and automatic parallel
execution."
She is a member of the American Philosophical Society and the National
Academy of Engineers, and is a Fellow of the American Academy of Arts
and Sciences, ACM, IEEE, and the Computer History Museum. She has served
on numerous national technology boards including CISE (the Computer and
Information Science and Engineering board) at the National Science
Foundation and CSTB (the Computer Sciences and Telecommunications Board)
for the National Research Council. Her many awards and honors include
honorary doctorates from the University of Alberta (1991), Pace
University (1999), and the University of Illinois at Urbana (2004).
Fran is an active mentor, advocate for technical women in computing,
environmentalist, and explorer.
Saman Amarasinghe
Title: StreamIt - A Programming Language for the Era of Multicores
Abstract: One promising approach to parallel programming is the use of novel
programming language techniques -- ones that reduce the burden on the
programmers, while simultaneously increasing the compiler's ability to get
good parallel performance. In this talk, I will introduce StreamIt: a
language and compiler specifically designed to expose and exploit inherent
parallelism in "streaming applications" such as audio, video, and network
processing. StreamIt provides novel high-level representations to improve
programmer productivity within the streaming domain. By exposing the
communication patterns of the program, StreamIt allows the compiler to
perform aggressive transformations and effectively utilize parallel
resources. StreamIt is ideally suited for multicore architectures; recent
experiments on the 16-core Raw machine demonstrate an 11x speedup over a
single core.
Bio: Saman P. Amarasinghe is an Associate Professor in the Department of
Electrical Engineering and Computer Science at Massachusetts Institute of
Technology and a member of the Computer Science and Artificial Intelligence
Laboratory (CSAIL). His research interests are in discovering novel
approaches to improve the performance of modern computer systems and make
them more secure without unduly increasing the complexity faced by either
the end users, application developers, compiler writers, or computer
architects. Saman received his BS in Electrical Engineering and Computer
Science from Cornell University in 1988, and his MSEE and Ph.D from Stanford
University in 1990 and 1997, respectively.
Nancy Amato
Title: STAPL: A High Productivity Programming Infrastructure for Parallel
and Distributed Computing
Abstract: The Standard Template Adaptive Parallel Library (STAPL) is a parallel
programming framework that extends C++ and STL with support for
parallelism. STAPL provides parallel data structures (pContainers)
and generic parallel algorithms (pAlgorithms), and a methodology for
extending them to provide customized functionality. By abstracting
much of the complexities of parallelism from the end user, STAPL
provides a platform for high productivity by enabling the user to
focus on algorithmic design instead of lower level parallel
implementation issues. In this talk, we provide an overview of
the major STAPL components, with a particular focus on the the
STAPL pContainers (parallel and distributed data structures) and,
as time allows, discuss STAPL's support for adaptive algorithm
selection and describe how some important scientific applications
(particle transport and protein folding) have been developed in STAPL.
This is joint work with Lawrence Rauchwerger.
Bio: Nancy M. Amato is a professor of computer science at Texas A&M University.
She received B.S. and A.B. degrees in Mathematical Sciences and Economics,
respectively, from Stanford University, and M.S. and Ph.D. degrees in
Computer Science from UC Berkeley and the University of Illinois at
Urbana-Champaign, respectively. She was an AT&T Bell Laboratories PhD
Scholar, she is a recipient of a CAREER Award from the National Science
Foundation, and she is a Distinguished Lecturer for the IEEE Robotics
and Automation Society. She served as an Associate Editor of the IEEE
Transactions on Robotics and Automation and of the IEEE Transactions on
Parallel and Distributed Systems, she serves on review panels for NIH and
NSF, and she regularly serves on conference organizing and program
committees. She is a member of the Computing Research Association's Committee
on the Status of Women in Computing Research (CRA-W) and she co-directs the
CRA-W's Distributed Mentor Program (http://www.cra.org/Activities/craw/dmp/).
Her main areas of research focus are motion planning, computational
biology and geometry, and high-performance computing. Current projects
include the development of a new technique for approximating protein
folding pathways and energy landscapes, and STAPL, a parallel C++ library
enabling the development of efficient, portable parallel programs.
More information regarding our work can be found at http://parasol.tamu.edu/.
Arvind
Title: A Hardware Design Inspired Methodology for Parallel Programming
Abstract: One source of weaknesses in parallel programming has been the lack of
compositionality; independently written parallel libraries and packages don't
compose very well. We will argue that perhaps traditional procedural
abstraction and abstract data types don't capture the essential differences
between parallel and sequential programming. We will present a different
notion of modules, based on guarded atomic actions, and view it as a resource
to be shared concurrently by other modules. As opposed to implicitly or
explicitly specifying parallelism in a program, we think of parallel
programming as a process of synthesis from a set of modules with proper
interfaces and composition rules. We will draw connections between this
hardware-design inspired methodology and traditional approaches to
multithreaded parallelism including programming based on transactions.
Bio: Arvind is the Johnson Professor of Computer Science and Engineering at MIT
where in the late eighties his group, in collaboration with Motorola, built
the Monsoon dataflow machines and its associated software. In 2000, Arvind
started Sandburst which was sold to Broadcom in 2006. In 2003, Arvind
co-founded Bluespec Inc., an EDA company to produce a set of tools for
high-level synthesis. In 2001, Dr. R. S. Nikhil and Arvind published the
book "Implicit parallel programming in pH". Arvind's current research
interests are synthesis and verification of large digital systems described
using Guarded Atomic Actions; and Memory Models for parallel architectures and
languages.
Chen Ding
Title: BOP: Software Behavior Oriented Parallelization
Abstract: Many sequential applications are difficult to parallelize because of
complex code, input-dependent parallelism, and the use of third-party
modules. These difficulties led us to build a software system for
behavior oriented parallelization (BOP), which allows a program to be
parallelized based on partial information about program behavior, for
example, a user reading just part of the source code, or a profiling
tool examining merely one or few executions.
The basis of BOP is programmable software speculation, where a user or
an analysis tool marks possibly parallel regions in the code, and the
run-time system executes these regions speculatively. It is
imperative to protect the entire address space during speculation. In
this talk I will describe the basic features of the prototype system
including the programming interface, parallelism analyzer, and the
run-time support based on strong isolation, value-based checking, and
non-speculative re-execution. On a recently acquired multi-core,
multi-processor PC, the BOP system reduced the end-to-end execution
time by integer factors for a set of open-source and commercial
applications, with no change to the underlying hardware or operating
system.
Bio: Chen Ding is an Associate Professor in the Computer Science Department
at the University of Rochester and presently an Visiting Associate
Professor in the EECS Department at MIT. He received the Early Career
Principal Investigator award from DoE, the CAREER award from NSF, the
CAS Faculty Fellowship from IBM, and a best-paper award from the IEEE
IPDPS. He co-founded the ACM SIGPLAN Workshop on Memory System
Performance and Correctness (MSPC) in 2002 and organized the Workshop
on High-Level Parallel Programming Models and Supportive Environments
(HIPS) in 2007. Between February and August 2007, he was a visiting
researcher at Microsoft Research. More information about his work can
be found at http://www.cs.rochester.edu/~cding/.
Rudolf Eigenmann
Title: Automatic Performance Tuning for Multicore Architectures
Abstract: One of the fundamental limitations of optimizing compilers is the lack of
runtime knowledge about the actual input data and execution platform. Automatic
performance tuning has the potential to overcome this limitation. I will
present the architecture and performance of an automatic performance tuning
system that pursues this goal. The system partitions a program into a number of
tuning sections and finds the best combination of compiler optimizations for
each section.
The performance tuning process includes several pre-tuning steps that partition
and instrument a program into suitable tuning sections, followed by the actual
tuning and the post-tuning assembly of the individually optimized parts. The
system, called PEAK, achieves fast tuning speed by measuring a small number of
invocations of each code section, instead of the whole-program execution time,
as in previous solutions. Compared to these solutions, PEAK reduces tuning
time from 2.19 hours to 5.85 minutes on average, while achieving similar
program performance. PEAK improves the performance of the SPEC CPU2000 FP
benchmarks by an average 12% over GCC O3, the highest optimization level, on a
Pentium IV machine.
The current system implementation tunes programs during a training run, then
freezes the optimization combination for the production runs. I will discuss
opportunities to perform the optimizations fully dynamically. Also, the only
optimizations being tuned are currently those exposed in the form of compiler
options. In ongoing work, we are exposing to the automatic tuning capability
additional, compiler-internal optimization parameters as well as optimization
parameters of library routines.
Bio: Rudolf Eigenmann is a professor at the School of Electrical and Computer
Engineering at Purdue University. He is also the Interim Director of the
Computing Research Institute and Associate Director of the Purdue's Cyber
Center. His research interests include optimizing compilers, programming
methodologies and tools, performance evaluation for high-performance computers
and applications, and Internet sharing technology. Dr. Eigenmann received his
Ph.D. in Electrical Engineering/Computer Science in 1988 from ETH Zurich,
Switzerland.
Manish Gupta
Title: A Data-Driven Co-operative Approach to Scaling of Commercial Java Codes
Abstract: The rising power dissipation in microprocessor chips is leading to a
fundamental shift in the computing paradigm, one
that requires software to exploit ever-increasing levels of parallelism. We
describe a data-driven approach, requiring
co-operation across multiple layers of software, to scaling of J2EE codes
to large scale parallelism. We present a study
of relevant characteristics of some J2EE programs, which provides evidence
of the applicability of our ideas. For the
given benchmarks, we show that a large percentage of objects, usually
greater than 90%, are thread-private. Furthermore,
a large percentage of locking operations on shared objects are on
infrequently written objects. We provide a detailed
analysis of lock contentions among user threads, and demonstrate that
threads can be naturally grouped based on lock
contention. Overall, we argue that there is a need to tackle the
scalability problem by applying optimizations across the
stack (comprising the application, application server, JVM, operating
system, and the hardware layers).
Bio: Manish Gupta is the Chief Technology Officer at the IBM India Systems and
Technology
Laboratory, and leads efforts to take on challenging new missions at the
lab. He is on
assignment from the IBM T. J. Watson Research Center in Yorktown Heights,
NY, where he was
a Senior Manager and led research on system software for the IBM Blue Gene
supercomputer
and high end servers. Manish received a B. Tech. in Computer Science from
IIT Delhi in
1987, a Ph.D. from the University of Illinois at Urbana-Champaign in 1992,
and has
worked with IBM since then. He has received two Outstanding Technical
Achievement
Awards at IBM, filed over a dozen patents, and has co-authored over 70
papers in the areas
of high performance compilers, parallel computing, and Java Virtual Machine
optimizations.
Maurice Herlihy
Title: Taking Concurrency Seriously: The Multicore Challenge
Abstract: Computer architecture is undergoing, if not another
revolution, then a vigorous shaking-up. The major chip manufacturers
have, for the time being, simply given up trying to make processors
run faster. Instead, they have recently started shipping "multicore''
architectures, in which multiple processors (cores) communicate
directly through shared hardware caches, providing increased
concurrency instead of increased clock speed.
As a result, system designers and software engineers can no longer
rely on increasing clock speed to hide software bloat. Instead, they
must somehow learn to make effective use of increasing parallelism.
This adaptation will not be easy. Conventional synchronization
techniques based on locks and conditions are unlikely to be effective
in such a demanding environment. Coarse-grained locks, which protect
relatively large amounts of data, do not scale, and fine-grained locks
introduce substantial software engineering problems.
Transactional memory is a computational model in which threads
synchronize by optimistic, lock-free transactions. This
synchronization model promises to alleviate many (perhaps not all) of
the problems associated with locking, and there is a growing community
of researchers working on both software and hardware support for this
approach. This talk will survey the area, with a focus on open
research problems.
Bio: Maurice Herlihy received an A.B. degree in Mathematics from Harvard
University and a Ph.D. degree in Computer Science from MIT. He has
been an Assistant Professor in the Computer Science Department at
Carnegie Mellon University, a member of the research staff at Digital
Equipment Corporation's Cambridge (MA) Research Lab, and a consultant
for Sun Microsystems. He is now a Professor of Computer Science at
Brown University. Prof. Herlihy's research centers on practical and
theoretical aspects of multiprocessor synchronization, with a focus on
wait-free and lock-free synchronization. His 1991 paper "Wait-Free
Synchronization" won the 2003 Dijkstra Prize in Distributed Computing,
and he shared the 2004 Goedel Prize for his 1999 paper
"The Topological Structure of Asynchronous Computation." He is a Fellow
of the ACM.
Laxmikant Kale
Title: Simplifying parallel programming with Non-complete deterministic languages
Abstract: With multicore machines on the desktops, parallel programming needs to
get down to the "masses" of programmers. Yet, it remains a complex
skill, difficult to master. Performance issues and correctness issues,
substantially beyond those encountered in sequential programming,
make it more complicated than sequential programming. Among existing
paradigms, MPI is considered low-level and suffers from modularity
issues. Shared Address Space programming is no easier, despite the
claims to the contrary, because of the large number of interleavings and
concomitant potential for race conditions. However, in the past several
years, I have started converging towards a model of parallel programming
that may lead us to a solution. The basic ideas in such a model include:
(1) automated resource management via over-decomposition into migratable
objects, (b)languages that sacrifice completeness for simplicity and
determinacy, and (c) an interoperability framework that allows modules
written in many such languages to synergistically coexist in a single
application. I will elaborate on these ideas, with two example languages
called Charisma and MSA, and the interoperability framework defined by
the Charm++ runtime system. I will illustrate them with examples drawn
from a number of scientific and engineering applications I have worked
on over the past 15 years.
Bio: Professor Laxmikant Kale has been working on various aspects of parallel
computing, with a focus on enhancing performance and productivity via
adaptive runtime systems, and with the belief that only
interdisciplinary research involving multiple CSE and other applications
can bring back well-honed abstractions into Computer Science that will
have a long-term impact on the state-of-art. His collaborations include
the widely used Gordon-Bell award winning (SC'2002) biomolecular
simulation program NAMD, and other collaborations on computational
cosmology, quantum chemistry, rocket simulation, space-time meshes, and
other unstructured mesh applications. He takes pride in his group's
success in distributing and supporting software embodying his research
ideas, including Charm++, Adaptive MPI and the ParFUM framework.
L. V. Kale received the B.Tech degree in Electronics Engineering from
Benares Hindu University, Varanasi, India in 1977, and a M.E. degree in
Computer Science from Indian Institute of Science in Bangalore, India,
in 1979. He received a Ph.D. in computer science in from State
University of New York, Stony Brook, in 1985.
He worked as a scientist at the Tata Institute of Fundamental Research
from 1979 to 1981. He joined the faculty of the University of Illinois
at Urbana-Champaign as an Assistant Professor in 1985, where he is
currently employed as a Professor.
Uday Khedker
Title: Efficiency, Precision, Simplicity, and Generality in Interprocedural Data Flow Analysis: Resurrecting the Classical Call Strings Method
Abstract: Context sensitive interprocedural data flow analysis requires
incorporating the effect of all possible calling contexts on the data
flow information at a program point. The call strings approach, which
represents context information in the form of a call string, bounds the
contexts by terminating the call string construction using precomputed
length bounds. These bounds are large enough to guarantee a safe and
precise solution, but usually result in a large number of call strings,
thereby rendering the method impractical.
We propose a simple change in the classical call strings method. Unlike
the traditional approach in which call string construction is orthogonal
to the computation of data flow values, our variant uses the equivalence
of data flow values for termination of call string construction. This
allows us to discard call strings where they are redundant and
regenerate them when required. For the cyclic call strings, regeneration
facilitates iterative computation of data flow values without explicitly
constructing most of the call strings. This reduces the number of call
strings, and hence the analysis time, by orders of magnitude as
corroborated by our empirical measurements.
On the theoretical side, our method reduces the worst case call string
length from quadratic in the size of lattice to linear. Further, unlike
the classical method, this worst case length need not be reached since
termination does not depend on constructing all call strings up to this
length. Our approach retains the precision, generality, and simplicity
of the call strings method and simultaneously reduces the complexity and
increases efficiency significantly without imposing any additional
constraints.
Bio: Uday Khedker holds a Ph.D. in Computer Science & Engineering from IIT
Bombay. He taught Computer Science at Pune University from 1994 to 2001
and since then is with IIT Bombay where currently he is an Associate
Professor of Computer Science & Engineering.
His areas of interest are Programming Languages and Compilers and
he specializes in data flow analysis and its applications to code
optimization. His current research topics include Interprocedural
Data Flow Analysis, Static Analysis of Heap Allocated Data, Static
Inferencing of Flow Sensitive Polymorphic Types, and Compiler
Verification. A recent research thrust involves cleaning up the GNU
Compiler Collection (GCC) to simplify its deployment, retargetting, and
enhancements. Other goals include increasing its trustworthiness as well
as the quality of generated code.
José Martínez
Title: Core Fusion: Accommodating Software Diversity in Chip Multiprocessors
Abstract: Chip multiprocessors (CMPs) hold the prospect of delivering long-term
performance growth by integrating more cores on the die with each new
technology generation. In the short term, on-chip integration of a few
relatively large cores may yield sufficient throughput when running
multiprogrammed workloads. However, harnessing the full potential of CMPs
in the long term makes a broad adoption of parallel programming
inevitable.
We envision a CMP-dominated future where a diverse landscape of software
in different stages of parallelization exists at all times.
Unfortunately, in this future, the inherent rigidity in current proposals
for CMP designs makes it hard to come up with a "universal" CMP that can
accommodate this software diversity.
In this talk I will discuss Core Fusion, a CMP architecture where cores
can "fuse" into larger cores on demand to execute sequential code very
fast, while still retaining the ability to operate independently to run
highly parallel code efficiently. Core Fusion builds upon a substrate of
fundamentally independent cores and conventional memory coherence/
consistency support, and enables the CMP to dynamically morph into
different configurations to adapt to the changing needs of software at
run-time. Core Fusion does not require specialized software support, it
leverages mature micro-architecture technology, and it can interface with
the application through small extensions encapsulated in ordinary
parallelization libraries, macros, or directives.
Bio: José Martínez (Ph.D.'02 Computer Science, UIUC) is assistant professor of
electrical and computer engineering and graduate field member of computer
science at Cornell University. He leads the M3 Architecture Research
Group at Cornell, whose interests include multicore architectures,
reconfigurable and self-optimizing hardware, and hardware-software
interaction. Martínez's work has been selected to IEEE Micro Top Picks
twice (2003 and 2007). In 2005, he and his students received the Best
Paper Award at HPCA-11 for their work on checkpointed early load
retirement. Martínez is also the recipient of a NSF CAREER Award and,
more recently, an IBM Faculty Award. His teaching responsibilities at
Cornell include computer architecture at both undergraduate and graduate
levels. He also organizes the AMD Computer Engineering Lecture Series.
Mayur Naik
Title: Effective Static Race Detection for Java
Abstract: Concurrent programs are notoriously difficult to write and debug, a problem
poised to become acute with the recent shift in hardware from uniprocessors
to multicore processors. A fundamental and particularly insidious
concurrency bug is a race: a condition in a shared-memory multithreaded
program in which a pair of threads may access the same memory location
without any ordering enforced between the accesses, and at least one of the
accesses is a write. Despite thirty years of research on race detection,
today's concurrent programs are still riddled with harmful races.
We present an effective approach to static race detection for Java. We
dissect the specification of a race to identify four natural conditions,
each of which is sufficient for proving a given pair of statements
race-free, but all of which are necessary in practice as different pairs of
statements in a given Java program may be race-free because of different
conditions. We present four novel static analyses each of which
conservatively approximates a separate condition while together enabling the
overall algorithm to report a high-quality set of potential races. We
describe the implementation of our approach in a tool Chord, and report upon
our experience applying it to a suite of multithreaded Java programs.
Our approach is sound in that it is guaranteed to report all races, it is
precise in that it misidentifies few non-races as races, and it is scalable
in that it is fully automatic and checks programs comprising hundreds of
thousands of Java bytecodes in under a few minutes. Finally, our approach
is effective, finding tens to hundreds of previously unknown concurrency
bugs in mature and widely-used Java programs, many of which were fixed
within a week of reporting.
Bio: Mayur Naik is a Ph.D. student in the Computer Science Department at Stanford
University, where he is advised by Professor Alex Aiken. Mayur's research
interests lie at the boundary of programming languages and software
engineering, with a current focus on concurrency. He obtained a B.E. in
Computer Science from BITS, Pilani, India in 1999 and a M.S. in Computer
Science from Purdue University in 2003. He was awarded a Microsoft
Fellowship in 2004-05.
Ramesh Peri
Title: Software Development Tools for Multi-Core/Parallel Programming
Abstract: The new era of multi-core processors is bringing unprecedented computing
power to the mainstream desktop applications. In order to fully exploit
this compute power one has to delve into the world of parallel
programming which until today has been the exclusive domain of High
Performance Computing Community. This talk will focus on the current
state of the art in parallel programming tools that is applicable for
developers of mainstream parallel applications with emphasis on software
development tools like compilers, debuggers, performance analysis tools
and correctness checking tools for parallel programs. I will share some
of the challenges that developers face today in developing applications
for multi-core systems containing a small number of homogeneous cores (2
to 8) and discuss the situation we will face with the advent of systems
containing many more heterogeneous cores in next few years.
Bio: Ramesh Peri is a Principal Engineer at Intel® Corporation in Performance
and Threading Tools Lab. He is manages a multi-geo group located in
Russia and United States and is responsible for development of data
collectors for performance analysis and correctness tools like Intel®
VTuneTM, Intel® ThreadChecker and Intel® ThreadProfiler. Prior to joining
Intel Ramesh worked in the area of software development tools at
Panasonic AVC Labs, Lucent Technologies and Hewlett Packard. Ramesh got
his Ph.D in computer science from University of Virginia in 1995.
Keshav Pingali
Title: Exploiting Data Parallelism in Irregular Programs
Abstract: The parallel programming community has a lot of
experience in exploiting data parallelism in regular
programs that deal with structured data such as
arrays and matrices. However, most client-side applications
deal with unstructured data represented using pointer-based
data structures such as trees and graphs. In her Turing award
lecture, Fran Allen raised an important question about
such programs: do irregular programs have data parallelism,
and if so, how do we exploit it on multicore processors?
In this talk, we argue using concrete examples that irregular
programs have a generalized kind of data parallelism that arises
from the use of iterative algorithms that manipulate worklists
of various sorts. We then describe the approach taken in the Galois
project to exploit this data-parallelism. There are three main
aspects to the Galois system: (1) a small
number of syntactic constructs for packaging optimistic
parallelism as iteration over ordered and unordered sets, (2)
assertions about methods in class libraries, and (3) a runtime
scheme for detecting and recovering from potentially unsafe
accesses to shared memory made by an optimistic computation. We present
experimental results that demonstrate that the Galois approach is
practical, and discuss ongoing work on this system.
Bio: Keshav Pingali is a professor in the Computer Science department
at the University of Texas, Austin, where he holds the
W.A."Tex" Moncrief Chair of Grid and Distributed Computing.
He received the B.Tech. degree in Electrical Engineering from
IIT, Kanpur, India in 1978, and the S.M. E.E., and Sc.D. degrees
from MIT in 1986. He was on the faculty of the Department of
Computer Science at Cornell University from 1986 to 2006, where
he held the India Chair of Computer Science.
Pingali's research has focused on programming
languages and compiler technology for program understanding, restructuring,
and optimization. His group is known for its contributions
to memory-hierarchy optimization; some of these have been patented.
Algorithms and tools developed by his projects are used in many
commercial products such as Intel's IA-64 compiler, SGI's MIPSPro
compiler, and HP's PA-RISC compiler. In his current research, he is
investigating optimistic parallelization techniques for multicore
processors, and language-based fault tolerance. Among other awards,
Pingali has won the President's Gold Medal at I.I.T. Kanpur (1978),
IBM Faculty Development Award (1986-88), NSF Presidential
Young Investigator Award (1989-94), Ip-Lee Teaching Award of the
College of Engineering at Cornell (1997), and the Russell
teaching award of the College of Arts and Sciences at Cornell (1998).
In 2000, he was a visiting professor at I.I.T., Kanpur where
he held the Rama Rao Chaired Professorship.
Lawrence Rauchwerger
Title: Automatic Parallelization with Hybrid Analysis
Abstract: Hybrid Analysis (HA) is a compiler technology
that can seamlessly integrate all static and
run-time analysis of memory references into a
single framework capable of generating sufficient
information for most memory related optimizations.
In this talk, we will present Hybrid Analysis as a framework
to perform automatic parallelization of loops.
For the cases when static analysis does not give conclusive results,
we extract sufficient conditions which are then evaluated
dynamically and can (in)validate the parallel execution of loops.
The HA framework has been fully implemented in the Polaris compiler
and has parallelized 22 benchmark codes with 99% coverage
and speedups superior to the Intel Ifort compiler.
Bio: Lawrence Rauchwerger is a Professor Computer Science and of Computer
Engineering in the Department of Computer Science, Texas A&M
University. He is also the co-Director of the Parasol Laboratory.
Lawrence Rauchwerger received an Engineer degree from the Polytechnic
Institute Bucharest, a M.S. in Electrical Engineering from
Stanford University and a Ph.D. in Computer Science
from the University of Illinois at Urbana-Champaign.
Since 1996 he has been on the faculty of the Department of Computer
Science at Texas A&M where he co-founded and co-directs the Parasol Lab.
He has held Visiting Faculty positions at the
University of Illinois at Urbana-Champaign, Bell Labs,
IBM T.J. Watson Research Center, and INRIA FUTURS, Paris.
Rauchwerger's research has targeted the area of high performance
compilers, libraries for parallel and distributed computing, and
adaptive optimizations and their architectural support.
He is known for introducing software thread level speculative
parallelization (TLS). Subsequently he introduced architectural
innovations to support speculative parallelization (together with
Josep Torrellas), He is also known for SmartApps (application
centric optimization), a novel approach to application optimization.
His current focus is STAPL, a parallel superset of the ISO C++ STL
library which is driven by his goal to improve the productivity of
parallel software development. His approach to parallel code
development and optimization (STAPL and SmartApps) has influenced
industrial products at major corporations. He has also been very
active in the development of parallel applications in the domains
of nuclear engineering and physics.
Vivek Sarkar
Title: Compiler Challenges for Multicore Parallel Systems
Abstract: This decade marks a resurgence for parallel computing with mainstream
and high-end systems moving to multicore processors. Unlike previous
generations of hardware evolution, this shift will have a major impact
on existing software. At the high end, it is widely recognized by
application experts that past approaches based on domain decomposition
will not scale to exploit the parallelism needed by multicore nodes.
In the mainstream, it is acknowledged by hardware vendors that
enablement of software for execution on multiple cores is the major open
problem that needs to be solved in support of this hardware trend. These
software challenges are further compounded by an increased adoption of
high performance computing in new application domains that may not fit
the patterns of parallelism that have been studied by the community thus
far.
In this talk, we outline the software stacks that are being developed
for multicore parallel systems, and summarize the challenges and
opportunities that they pose to compilers. We discuss new opportunities
for compiler research created by recent work on high productivity
parallel languages and on lightweight concurrency in virtual machines
(managed runtimes). Examples will be give from research projects under
way in these areas including PGAS languages , Java Concurrency
Utilities, and the X10 language. Finally, we outline the new Habanero
research project initiated at Rice University with the goal of producing
portable parallel software that can run efficiently on a wide range of
homogeneous and heterogeneous multicore systems.
Bio: Professor Vivek Sarkar conducts research in programming languages,
program analysis, compiler optimizations and virtual machines for
parallel and high performance computer systems. His past projects
include the X10 programming language, the Jikes Research Virtual Machine
for the Java language, the ASTI optimizer used in IBM^?s XL Fortran
product compilers, the PTRAN automatic parallelization system, and
profile-directed partitioning and scheduling of Sisal programs. He is in
the process of starting up the Habanero Multicore Software project at
Rice University which spans the areas of programming languages,
optimizing and parallelizing compilers, virtual machines, and
concurrency libraries for homogeneous and heterogeneous multicore
processors.
Vivek became a member of the IBM Academy of Technology in 1995, an ACM
Distinguished Scientist in 2006, and the E.D. Butcher Professor of
Computer Science at Rice University in 2007. Prior to joining Rice
University in July 2007, Professor Sarkar was Senior Manager of
Programming Technologies at IBM Research. His responsibilities at IBM
included leading IBM's research efforts in Programming Model, Tools, and
Productivity in the PERCS project during 2002 - 2007 as part of the
DARPA High Productivity Computing System program. Vivek holds a B.Tech.
degree from the Indian Institute of Technology, Kanpur, an M.S. degree
from University of Wisconsin-Madison, and a Ph.D. from Stanford
University. In 1997, he was on sabbatical as a visiting associate
professor at MIT, where he was a founding member of the MIT RAW
multicore project.
Y. N. Srikant
Title: Energy-aware Compiler Optimizations
Abstract: The importance of saving energy in modern times need not be overstressed. With the
prolific increase in the usage of processors in embedded systems of all types, there
is a strong requirement to make the batteries last even longer than before, so that
the devices which use them can operate longer without changing batteries. The role
of compilers in producing energy-efficient code is a rather important one. Compilers
can aid the hardware techniques already available for reducing energy comsumption.
In this talk, I will describe some of the current day techniques available to the
compiler writer for minimizing energy consumption without much performance penalty.
These include instruction scheduling to reduce leakage energy consumption and to
reduce energy consumption in the interconnects, apart from dynamic voltage scaling.
Bio: Y.N. Srikant received his B.E in Electronics from Bangalore University, and M.E and
Ph.D in Computer Science from the Computer Science and Automation department at the
Indian Institute of Science. His area of interest is compiler design. He is the
co-editor of a handbook on advanced compiler design published by CRC Press in 2002
(currently under revision).
Josep Torrellas
Title: Lessons Learned in Designing Speculative Multithreaded Hardware
Abstract: Perhaps the biggest challenge facing computer architects
today is how to design parallel architectures that make it
easy for programmers to write parallel codes. In this talk,
I will summarize the lessons learned in the past 10 years
as we examined the design of multiprocessors with speculative
multithreading. I will discuss the uses of this technology
for performance (Thread-Level Speculation, Speculative
Synchronization, Cherry, Bulk, and BulkSC), hardware
reliability (Paceline), and software dependability (ReEnact,
ReSlice and Iwatcher).
Bio: Josep Torrellas (http://iacoma.cs.uiuc.edu) is a Professor and
Willett Faculty Scholar at the University of Illinois. Prior to
being at Illinois, Torrellas received a PhD from Stanford University.
He also spent a sabbatical year as Research Staff Member at IBM's
T.J. Watson Research Center. Torrellas's research area is multiprocessor
computer architecture, focusing on speculative multithreading,
multiprocessor organization, integration of processors and memory,
and architectural support for software dependability and hardware
reliability. He has been involved in the Stanford DASH and the
Illinois Cedar multiprocessor projects, and lead the Illinois
Aggressive COMA and FlexRAM Intelligent Memory projects. He has
published over 150 papers in computer architecture. Torrellas is
an IEEE Fellow and the Chairman of IEEE Technical Committee on
Computer Architecture. He received an NSF Young Investigator Award.
David Wood
Title: Performance Pathologies in Hardware Transactional Memory Systems
Abstract: Hardware Transactional Memory (HTM) systems reflect
choices from three key design dimensions: conflict detection, ver-
sion management, and conflict resolution. Previously proposed
HTMs represent three points in this design space: lazy conflict
detection, lazy version management, committer wins (LL); eager
conflict detection, lazy version management, requester wins (EL);
and eager conflict detection, eager version management, and
requester stalls with conservative deadlock avoidance (EE).
To isolate the effects of these high-level design decisions, we
develop a common framework that abstracts away differences in
cache write policies, interconnects, and ISA to compare these three
design points. Not surprisingly, the relative performance of these
systems depends on the workload. Under light transactional loads
they perform similarly, but under heavy loads they differ by up to
80%. None of the systems performs best on all of our benchmarks.
We identify seven performance pathologies--interactions
between workload and system that degrade performance--as the
root cause of many performance differences: FRIENDLYFIRE,
STARVINGWRITER, SERIALIZEDCOMMIT, FUTILESTALL, STARVIN-
GELDER, RESTARTCONVOY, and DUELINGUPGRADES. We discuss
when and on which systems these pathologies can occur and show
that they actually manifest within TM workloads. The insight pro-
vided by these pathologies motivated four enhanced systems that
often significantly reduce transactional memory overhead. Impor-
tantly, by avoiding transaction pathologies, each enhanced system
performs well across our suite of benchmarks.
Bio: Prof. David A. Wood is a Professor and Romnes Fellow in
the Computer Sciences Department at the University of Wisconsin, Madison.
Dr. Wood also holds a courtesy appointment in the Department of
Electrical and Computer Engineering. Dr. Wood received
a B.S. in Electrical Engineering and Computer Science (1981) and a
Ph.D. in Computer Science (1990), both at the
University of California, Berkeley.
He joined the faculty at the University of Wisconsin in 1990.
Dr. Wood was named an ACM Fellow (2005) and IEEE Fellow (2004),
received the University of Wisconsin's H.I. Romnes Faculty Fellowship
(1999), and received
the National Science Foundation's Presidential Young
Investigator award (1991).
Dr. Wood is Area Editor (Computer Systems) of ACM Transactions
on Modeling and Computer Simulation, is Associate Editor of ACM Transactions
on Architecture and Compiler Optimization, served as Program Committee
Chairman of ASPLOS-X (2002), and has served on numerous program committees.
Dr. Wood is an ACM Fellow, an IEEE Fellow, and a member of
the IEEE Computer Society.
Dr. Wood has published over 70 technical papers and is
an inventor on eleven U.S. and International patents.
Dr. Wood co-leads the Wisconsin Multifacet project with Prof. Mark Hill
(URL http://www.cs.wisc.edu/multifacet) which is exploring techniques
for improving the availability, designability, programmability, and
performance of commercial multiprocessor and chip multiprocessor servers.
|