CPA-CSA-T:
ATAC: Enhancing Multicore Programmability through All-to-All Computing
PROJECT SUMMARY
The computing world has made a generational shift to multicore as a way of addressing the Moore’s gap, which is the growing disparity between the performance offered by sequential processors and the scaling expections set by Moore’s law. Two- or four-core multicores are commonplace today, with scaling to 1000’s of cores expected by the middle of the next decade. Unfortunately, because even two- and four-core multicores (let alone thousand-core multicores) are extremely hard to program, multicore’s widespread acceptance is threatened [60]. The multicore programming challenge is a serious issue and requires us to think about bold new approaches to architecture, programming, and software.
The ATAC project is based on one simple idea. The idea is that a low latency, low energy broadcast mechanism from any core to all other cores can yield a big step forward in multicore programmability. The broadcast mechanism is enabled by CMOS-integrated chip-level optical interconnection using WDM (wave-division multiplexing) with multiple add/drop points. The optical interconnect augments a traditional electrical-mesh interconnect in a tiled multicore processor. We believe that although point-to-point electrical interconnect is capable of delivering performance that is competitive with on-chip optical interconnect, it does not solve the programmbility issue. Thus, in ATAC, the optical broadcast capability does not replace basic electrical interconnect, it simply augments it for programmability. Previous work of the co-PIs of this proposal has demonstrated that the on-chip optical broadcast network integrated with a standard CMOS process is feasible to build, and that the availability of the broadcast mechanism to programmers through a software API can yield significant programmability gain.
Accordingly, to drastically ease programming for multicores, we propose the ATAC computer architecture that augments an on-chip mesh network with an on-chip optical broadcast network. Such a network enables blazingly fast broadcast communication that will allow programmers to fully take advantage of the multicore opportunity, even as multicores scale to thousands of cores. Although this capability has the potential to greatly speed-up existing algorithms, its biggest appeal lies in its ability to facilitate new, easy-to-use programming models. An efficient broadcast mechanism allows for novel, distributed coherent-shared-memory architectures, as well.
We have assembled a cross disciplinary team with expertise in computer architecture, programming languages and compilers, VLSI design, and integrated microphotonics. Our team is led by the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) and MIT Microphotonics Center (MPC) in collaboration with the MIT Microsystems Technology Laboratories (MTL), Sandia Labs, and Intel Corp., to create a new computing platform that has the potential to significantly simplify multicore programming.
The ATAC project proposes to design and to build a prototype computer system using the ATAC approach. This includes the ATAC computer architecture design, a detailed computer system simulator, a compiler system, a runtime system, a programming model and associated APIs that leverage the broadcast capability of the underlying ATAC multicore, and architectural models and interfaces for the optical interconnect technology. The optical interconnect design and fabrication is supported by DARPA’s EPIC (Electrical and Photonic Integrated Circuits) program. The multicore simulator will be developed in collaboration with Intel. The simulator is based on the Pin dynamic instrumentation infrastructure and will enable us to simulate 1000’s of cores on our existing compute server farm with over 100 processors.
The proposed research will make five fundamental contributions. The first contribution is increased programmability for multicores due to ATAC’s seamless integration of the efficient broadcast facility enabled by an on-chip optical network along with the electrical mesh network. The second contribution is the development of Application Programming Interfaces (APIs) and programming languages that allow algorithms to best take advantage of ATAC’s on-chip communication infrastructure while minimizing the burden on the programmer. The third contribution is the development of appropriate metrics that assess programmer productivity when developing parallel applications on multicore architectures. Fourth, this research will implement a multicore simulator that can simulate thousands of cores, and can run on current multicore hardware. Finally, this research will assess how well multicores scale to thousands of cores, especially in light of performance and energy scalability.
The broader impacts of this work will include easing the multicore programming burden for future multicore systems, thereby removing the fundamental constraint to mass adoption of multicore. It will also create the “killer app” for on-chip optical interconnect, thereby bringing this technology into the mainstream. The project will also train undergraduate and graduate students in system building and multicore software and hardware technologies. As with our previous Raw and Alewife projects, we will involve several undergraduate students in our research, thereby helping to train the next generation of multicore researchers and programmers.
Results from Prior NSF Support
Project ATAC’s team includes Profs. Anant Agarwal (Computer Science and Artificial Intelligence Lab, MIT), Saman Amarasinghe (Computer Science and Artificial Intelligence Lab, MIT), and Lionel Kimerling (Microphotonics Center, MIT).
Anant Agarwal Prof. Agarwal’s research group focuses on computer architecture, compilers, VLSI and applications. We refer to current and previous NSF grants EIA-9810173 and EIA-0071841, “Baring It All to Software: The MIT Raw Machine”, as the Raw project, and MIP-9012773, “Automatic Management of Locality in a Scalable Cache-Coherent Multiprocessor: The MIT Alewife Machine,” as the Alewife project.
Figure 1: Die photo of the Raw processor (left), and photo of the Raw prototype computer system (right).
The Raw project [2,4,5,6,10] designed and implemented [7] the Raw processor (a single-chip tiled multicore), a distributed ILP and stream compiler [9], runtime and operating systems, tools and other system software. The Raw processor, implemented in the .18 micron IBM SA-27E ASIC process, occupies an 18.2x18.2mm die and runs at 420MHz. The Raw chip was operational in October 2002. The prototype system is fully operational and many researchers from MIT as well as other institutions such as USC/ISI, Lincoln Labs, and Lockheed Martin have used it. We also built a Raw Fabric multiprocessor system consisting of 4 Raw chips (for a total of 64 tiles).
The Raw effort pioneered the tiled architecture concept. It also conceptualized the notion of a scalar operand network, or SON [8]. The Raw effort developed the notion of the on-chip distributed direct cache. The project created a characterization of on-chip networks called Astro [8], which facilitates comparison of different tiled and conventional superscalar processors such as Trips, Scale, ILDP and others.
The effort also led to many fundamental discoveries in compiler techniques for orchestrating ILP and streams including algorithms for distributed ILP (DILP), control localization, space-time instruction scheduling [9], software-serial ordering, modulo unrolling and equivalence class unification. The effort also produced early work on transactional software methods such as SUDS (Software Undo System) [12]. For more details please refer to the Raw publications site available at www.cag.csail.mit.edu/raw, or Agarwal’s retrospective on Raw and tiled processors given as a keynote at the 2007 Micro 40 conference (the talk is available from the conference web site).
The Raw effort developed several applications for multicores. It also created a new metric for the versatility of processors and distributed an associated benchmark suite called VersaBench (http://cag.csail.mit.edu/versabench/ or google versabench).
The Raw project impacted the community in several ways. First, the project produced several dozen research papers. The project also graduated several Post Docs, PhD, MS and BS students, many of whom went on to become professors at other universities (E.g., Matt Frank at UIUC, Rajeev Barua at U Maryland, Andras Moritz at U Amherst, Michael Taylor at UC San Diego).
The tiled multicore technology developed by the Raw project was also commercialized by a venture-funded startup called Tilera. Tilera has made commercially available a 64-tile multicore chip called the Tile Processor. The Tile Processor was also chosen by the NRO for US space-based applications.
The Alewife project [14] was also funded by NSF and was conducted at MIT in the early 90’s. The goal of the Alewife experiment was to discover and to evaluate mechanisms for automatic locality management in scalable multiprocessors. The MIT Alewife machine became operational on May 7, 1994. Like the Raw project, Alewife involved a major system building effort. A 32-node machine was in regular use until 1998. The machines and simulators have also been used in graduate-level courses and a summer course for industry participants at MIT.
Alewife pioneered the integration of message passing and shared memory into a single coherent interface, and a flexible, software-extended, shared-memory system called LimitLESS directories. Alewife’s Sparcle processor [15] was an early demonstration of a multithreaded microprocessor. It created the concept of coarse-grain multithreading (CGMT). Several Sparcle mechanisms — including trap vector spreading for fast exception handling, rapid context switching, and user-level address space identifiers for fast, user-level messages — influenced SPARC V9.
The Alewife project produced dozens of publications. Alewife and Virtual Wires papers and pictures are available through the web sites www.cag.csail.mit.edu/alewife, www.cag.csail.mit.edu/multiscale, and www.cag.csail.mit.edu/vwires.
Saman Amarasinghe Professor Amarasinghe’s current work on the StreamIt project was supported by an NSF NGS award (0305453) “StreamIt: A Language and a Compiler for Streaming Applications” and by an NSF ITR award (0325297) “A Language, Compilers and Tools for the Streaming Application Domain”.
StreamIt [34,35,36,37,38,39,33,13] is aiming to ease the burden of programming multicore architectures by developing high-level programming idioms, compiler technologies, and runtime systems. StreamIt project has two goals: to improve programmer productivity for a streaming class of applications and to obtain high performance, portability, and scalability for StreamIt programs.
Improving Programmer Productivity In StreamIt, the programmer builds an application by connecting components together into a stream graph, where nodes represent filters that carry out the computation, and edges represent FIFO communication channels between filters. As a result, the parallelism and communication topology of the application are exposed, empowering the compiler to perform many stream-aware optimizations that elude other languages.
In StreamIt, all of the processing is encapsulated hierarchically into single-input, single-output streams with well-defined modular interfaces. This facilitates development and boosts programmer productivity, as components can be debugged and verified as standalone components.
High performance, portability, and scalability StreamIt attempts to expose the common properties across all multicore architectures in the lanauge while abstracting away the differences between different architectures. Thus, common properties such as are multiple flows of control and multiple local memories are directly exposed in the language, making the abstraction boundary between the language and the architecture that the compiler has to bridge as narrow as possible. Thus, unlike existing imperative languages where compilers have to do heroic to impossible tasks, StreamIt compiler is able to achieve respectable performance with relative ease.
StreamIt also hides processor specific properties such as the number of cores, communication primitives and topology, computation strength and memory layout within the cores etc. In the StreamIt compiler, we are developing the algorithms needed to effectively take advantage of these properties without loosing portability or performance.
Lionel Kimerling Prof. Kimerling’s research [16,17,18,19,22,23,24,25,26,27,28,29] has focused on microphotonic integration on the silicon platform for over 20 years. Among the group’s achievements are: 1) a monolithically integrated MOSFET driver and Si:Er LED emitting at 1550nm (operating at room temperature); 2) the first low loss, silicon channel waveguides; 3) the first omnidirectional dielectric stack reflector; 4) the first Ge-on-Si photodetector; 5) the first silicon disk, ring and race track resonators and integrated silicon bus-add/drop filters; 6) discovery of a strong Franz-Keldysh effect in Ge and application to integrated SiGe optical modulators; 7) the smallest waveguide integrated Ge-on-Si photodetectors exhibiting 100% quantum efficiency; record low loss silicon channel waveguides (0.35 dB/cm); 8) demonstrated full CMOS process flow for monolithically integrated Si/Ge-on-Si waveguide/modulator/ring resonator/photodetector circuit; 9) first monolithically integrated silicon optical RF channelizer.
In terms of NSF sponsored research, Kimerling established and led the Microphotonics IRG in the MIT MRSEC from 1997-2002. The team studied the silicon platform for HIC photonic materials and devices. They created the first photonic crystal device to operate at the wavelength of 1550nm, a waveguide-integrated, photonic crystal add/drop filter. That structure continues to hold the record for the smallest modal volume of a photonic device.
Strong Atom-Photon Interaction for Microphotonic Devices (2000-2002) We observed the first enhancement of 1550nm emission from Er2O3 in a Si/SiO2 microcavity; we observed the first evidence of THz Rabi splitting from a matched cavity structure of the same materials; we created continuously tunable (1200-1600nm with bias <12volts) MEMS microcavity devices using double resonant structures.
Agglomeration of Ultra-thin Silicon-On-Insulator Films: Understanding Dewetting in Crystalline Thin Films (2004-2006) We developed a 5-step surface-energy-driven dewetting model for SOI agglomoration based on the capillary film edge stability and the generalized Rayleigh instability. Our surface-energy-driven model was able to well explain all of the key experimental observations in the existing literature and our own new experimental results. For the first time, we observed highly anisotropic dewetting behavior that was very sensitive to the edge orientation of a patterned mesa. We demonstrated the effectiveness of a dielectric edge coverage technique for stabilizing patterned SOI structures against dewetting.
Introduction
The trend in computer architecture for the foreseeable future is clear: microprocessor designers are using copious silicon resources to integrate more and more processor cores onto a chip. In fact, within the next ten years, general purpose multicore processors will likely contain 1,000 cores or more. While this path towards ever-increasing parallelism will theoretically enable massive performance, it is unlikely that application developers will be able to harness this potential unless drastic improvements to programming are made [60]. Current approaches to multicore programming are barely manageable for multicores with two or four cores, but they certainly will not scale to massive amounts of parallelism. A new architectural mechanism---fast on-chip broadcast enabled by novel optical technology---will revolutionize the programmability of future multicore processors.
While parallel programming used to be somewhat of a black art reserved for the handful of rocket scientists that programmed supercomputers and clusters, multicore’s imminent rule will require most programmers to implement parallel applications. However, by incorporating powerful hardware and architectural mechanisms, such as a fast broadcast, and empowering programmers with the right interfaces to the underlying architecture via APIs and language facilities, all programmers will be able to efficiently construct programs that exploit multicore’s power.