William J. Dally

Last updated July 27, 2010

Bill Dally is the Willard R. and Inez Kerr Bell Professor of Computer Science and Electrical Engineering and former Chairman of the Computer Science Department at Stanford University. He is a member of the Computer Systems Laboratory, leads the Concurrent VLSI Architecture Group, and teaches courses on Computer Architecture, Computer Design, and VLSI Design. He is a Member of the National Academy of Engineering , a Fellow of the American Academy of Arts & Sciences , a Fellow of the IEEE, a Fellow of the ACM, received the ACM Maurice Wilkes Award in 2000, the IEEE Seymour Cray Award in 2004, and the ACM Eckert Mauchly Award in 2010. He has an h-index of 60.

Before coming to Stanford, Bill was a Professor in the department of Electrical Engineering and Computer Science at MIT .

Current Projects

ELM: The Efficient Low-Power Microprocessor
We are developing a programmable architecture that is easily programmable in a high-level language ("C") and at the same time has performance per unit power competitive with hard-wired logic, and 20-30x better than conventional embedded RISC processors. This power savings is achieved by using more efficient mechanisms for instruction supply, based on compiler managed instruction registers, and data supply, using a deeper register hierarchy and indexable registers.
Enabling Technology for On-Chip Networks
As CMPs and SoCs scale to include large numbers of cores and other modules, the on-chip network or NoC that connects them becomes a critical systems component. We are developing enabling technology for on-chip networks including network topologies, flow control mechanisms, and router organizations. For example, our flattened butterfly topology offers both lower latency and substantially reduced power compared to conventional on-chip mesh or ring networks.
Sequoia: Programming the Memory Hierarchy
Sequoia is a programming language that is designed to facilitate the development of memory hierarchy aware parallel programs that remain portable across modern machines with different memory hierarchy configurations. Sequoia abstractly exposes hierarchical memory in the programming model and provides language mechanisms to describe communication vertically through the machine and to localize computation to particular memory locations within the machine. A complete Sequoia programming system has been implemented, including a compiler and runtime systems for both Cell processors and distributed memory clusters, that delivers efficient performance running Sequoia programs on both of these platforms. An alpha version of this programming system will soon be made public.
Scalable Network Fabrics
We are developing architectures and technologies to enable large, scalable high-performance interconnection networks to be used in parallel computers, network switches and routers, and high-performance I/O systems. Recent results include the development of a hierarchical network topology that makes efficient use of a combination of electrical and optical links, a locality-preserving randomized oblivious routing algorithm, a method for scheduling constrained crossbar switches, new speculative and reservation-based flow control methods, and a method for computing the worst-case traffic pattern for any oblivious routing function.

Recent Projects

Streaming Supercomputer
We are developing a streaming supercomputer (SS) that is scalable from a single-chip to thousands of chips that we estimate will achieve an order of magnitude or more improvement in the performance per unit cost on a wide range of demanding numerical computations compared to conventional cluster-based supercomputers. The SS uses a combination of stream processing with a high-performance network to access a globally shared memory to achieve this goal.
Imagine: A High-Performance Image and Signal Processor
Imagine is a programmable signal and image processor that provides the performance and performance density of a special-purpose processor. Imagine achieves a peak performance of 20GFLOPS (single-precision floating point) and 40GOPS (16-bit fixed point) and sustains over 12GFLOPS and 20GOPS on key signal processing benchmarks. Imagine sustains a power efficiency of 3.7GFLOPS/W on these same benchmarks, a factor of 20 better than the most efficient conventional signal processors.
Smart Memories
We are investigating combined processor/memory architectures that are best able to exploit 2009 semiconductor technologies. We envision these architectures being composed of 10s to 100s of processors and memory banks on a single semiconductor chip. Our research addresses the design of the processors and memories, the architecture of the interconnection network that ties them together, and mechanisms to simplify programming of such machines.
High-Speed Signalling
We are developing methods and circuits that stretch the performance bounds of electrical signalling between chips, boards, and cabinets in a digital system. A prototype 0.25um 4Gb/s CMOS transceiver has been developed, dissipating only 130mW, amenable for large scale integration. Future chips include a a 20Gb/s 0.13um CMOS transceiver.
The M-Machine
Is an experimental parallel computer that demonstrated highly-efficient mechanisms for parallelism including two-level multithreading, efficient network interfaces, fast communication and synchronization, and support for efficient shared memory protocols.
The Reliable Router
is a high-performance multicomputer router that demonstrates new technologies ranging from architecture to circuit design. At the architecture level the router uses a novel adaptive routing algorithm, a link-level retry protocol, and a unique token protocol. Together the two protocols greatly reduce the cost of providing reliable, exactly-once end-to-end communication. At the circuit level the router demonstrates the latest version of our simultaneous bidirectional pads and a new method for plesiochronous synchronization.
The J-Machine
is an experimental parallel computer, in operation since July 1991, that demonstrates mechanisms that greatly reduce the overhead involved in inter-processor interaction.


A complete list of publications and citations is available here from Google Scholar . Publications can be found at the CVA group publications page

Some selected publications are included below:


Bill is Chief Scientist at NVIDIA where he was on leave during 2009 and 2010 as Chief Scientist and Senior Vice President of Research.

Bill has played a key role in founding several companies including:

Stream Processors Inc.(2004-2009)
to commercialize stream processors for embedded applications.
Velio Communications. (CTO 1999-2003)
Velio pioneered high-speed I/O circuits and applied this technology to integrated TDM and packet switching chips. Velio's I/O technology was acquired by Rambus and Velio itself was acquired by LSI Logic.
Avici Systems, Inc.(1997-present)
Manufactures core Internet routers with industry-leading scalability and reliability.

Bill has also worked with Cray since 1989 on the development of many of their supercomputers including the T3D and T3E.


CVA People

William J. Dally
<dally "at" stanford "dot" edu>
Stanford University
Computer Systems Laboratory
Gates Room 301
Stanford, CA 94305
(650) 725-8945
FAX: (650) 725-6949