1005EF3-Nvidia.doc

Keywords: simulation, hardware acceleration, verification

Editorial Feature Tabs: Digital – Hardware Simulation

@head:Build an Optimized Chip-Verification Environment for Hardware Simulation

@deck:At the beginning of the project, construct a system-level testbench that runs in both a software and hardware simulation environment.

@text:Today’s chip designers must overcome the challenge of verifying complex chips. Current verification environments are being stretched to their limits to verify current designs. In order to simulate those designs, simulation times and capacity issues force many projects to use hardware acceleration. Typically, the verification frameworks that aren’t designed with hardware acceleration in mind don’t approach the maximum simulation speeds of which the hardware accelerator is capable. Instead, they approach zero cycles per second (CPS). In fact, the register-transfer-level (RTL) verification environments often cannot compile using a cycle-based hardware-accelerator compiler without modifying the RTL.

To achieve high-performance simulation, coverage tools and assertion checking need to be run at speed while test suites are executing on the hardware accelerator. For example, static-assertion checking can be coded exclusively in Verilog. Temporal-assertion checking may need to be coded in a third-party tool, such as Vera or Verisity, and executed on the host.

Cycle Versus Event Simulation

Verilog compilers effectively convert RTL constructs into cones of logic. Hardware compilers convert those cones into the target hardware’s parallel instruction-set architecture or ISA (i.e., partitioning, placement, routing, and code generation). Or they pass the logic cones to a synthesis tool, which generates bit streams to load into an array of FPGAs. Commercial cycle-based and event-driven hardware accelerators (hardware simulators) have been built using custom ISAs and arrays of FPGAs. The hardware-accelerator implementation is independent of the architecture.

A logic function has a set of inputs and outputs. In an event-driven simulator, the entire logic function is re-evaluated each time one of the inputs changes value. Re-evaluation increases the simulation time and reduces the event-driven simulator’s performance. Cycle-based simulators evaluate each logic function once per cycle. Cycle-based simulators effectively use flattened cones of logic. Each cone is evaluated once during a simulation cycle. Event-driven simulators use a timing wheel that increments each time the outputs of the logic cones have been evaluated. Every time a change occurs on the inputs to a logic cone, the output values are recalculated. An event-driven simulator can therefore evaluate the same logic more than once during a simulation cycle.

Cycle-based simulators have a higher CPS than event-driven simulators because they do less work. Verilog must be coded using a restricted subset in order to compile and execute correctly using a cycle-based simulator. Lint tools can restrict the subset of Verilog code that’s used for cycle-based simulation. Examples of code that won’t compile on a cycle-based compiler include component loops and clock logic with glitches. Cycle-based simulators need to perform clock-tree and sequential element analysis to determine whether the design can be compiled for cycle-based simulation.

Let’s consider the different types of testbench reference models. One typical approach is to use a C++ behavioral simulator to generate vectors which can be used to stimulate inputs and check the outputs of the DUT in the testbench. The host and the hardware accelerator communicate over the simulation backplane (see Figure 1). The testbench is not executing during most types of communication.

If the project wants to use a high level language reference model for part of their verification flow then a testbench can be constructed which can selectthe architectural state or transactions (see Figure 2) from a source internal or external to the testbench. There are two alternatives offered here:

  • A C/C++ reference model connected to PLI in the testbench with a protocol convertor via a socket
  • An RTL reference model instantiated in the testbench(see Figure 3)

Testbenches And Abstract Models

Testbenches can use abstract models when they’re constructing a verification environment to verify a DUT. The abstract models can be written in C/C++ or Verilog. C/C++ abstract models run on the host and communicate with the testbench on the hardware accelerator via a socket. Verilog models can run on the hardware accelerator. Some examples of components that can be replaced by abstract models include CPU or GPU, sparse-memory, and bus-functional models.

What if there are C/C++ abstract models? The testbench that’s running on the hardware accelerator needs to communicate over the simulation backplane with the C/C++ abstract models running on the host. This task reduces the simulation performance. As a result, it is better to use synthesizable Verilog models in the testbench when using the hardware accelerator.

What happens when the C/C++ or Vera Verification components used in the high-level reference model (HLRM) slow down a cycle-based simulation? The verification environment will simulate at a slower rate if the testbench makes PLI calls to the host during simulation. PLI call examples include the following:

  • Communication with reference models (e.g., C/C++ simulators)
  • Communication with behavioral models (e.g., C/C++ sparse-memory models)
  • Updating implementation state shadowed in C/C++ high-level reference model
  • Synchronizing the events between the HLRM and Verilog simulation on the hardware accelerator
  • Communicating with state/ordering checkers
  • Communicating with coverage tools
  • High-speed clocks generated in a testbench that’s running outside of the hardware accelerator

For the cycle-based simulation testbench incompatibilities, one solution is to use an ideal testbench and reference-model design partitioning. Hardware accelerators run the fastest when the simulation is wholly contained in the hardware accelerator and there are no simulation backplane events (i.e., communication between the host and the hardware accelerator). Eliminating such communication requires that the testbench include all reference model(s), shadow state, simulation control, and checkers. The verification testbench, which will execute at the highest frequency on a hardware accelerator, is one that doesn’t use any PLI calls. To maximize the simulation frequency, the testbench that’s loaded into the hardware accelerator must contain all of the verification-environment components and DUT drivers.

System-Level Simulation

Several system-level issues affect simulation frequencies. The first of these challenges is the lack of capacity on the hardware accelerator. Even with a relatively fast simulator, it’s difficult for most projects to achieve adequate system-level simulation. Due to low simulation cycles per second, it may not be possible to thoroughly test at the system level in a software or hardware simulator. Because of a lack of capacity on a hardware accelerator, it also may not be possible to load the binary executable into the hardware accelerator. The hardware accelerator’s compiler may not be able to compile the system-level testbench to a hardware-accelerator executable.

Another system-level challenge occurs during compilation to a hardware-accelerator target. It’s feasible to support system-level simulation on hardware accelerators during the pre-tape-out timeframe. To do so, however, the DUT and testbench must be compiled by the hardware-accelerator compiler. They also must be able to execute correctly on the hardware accelerator. These goals are possible if the DUT, testbench, and verification environment are designed and implemented from the beginning of the project to compile on a hardware accelerator. If no previous effort was made to compile and simulate using a hardware accelerator, it’s not advisable to compile designs using the hardware-accelerator compiler at the end of the project. Such an effort will probably result in code branching and less than compatible simulation and tape-out code branches.

Sequential versus parallel-processing issues also can stymie the test engineer. A simulation that is running on a hardware accelerator and communicates with an HLRM running on a host becomes a sequential process. After all, the simulation on the hardware accelerator blocks when it makes a PLI call to the host. A simulation that runs only in the hardware accelerator and doesn’t make any external calls to the host is a parallel process.

Sometimes, RTL coding styles aren’t compatible with the cycle-based simulators. RTL simulators can be event-driven or cycle-based. Event-driven simulators typically require more real compute cycles than a cycle-based simulator to complete the same simulation. Cycle-based simulation reduces the number of compute cycles and the simulation time and increased the cycles per second. It is therefore preferable to use RTL, which can be compiled by a cycle-based compiler as long as the cycle-based coding style doesn’t impact the project’s schedule. Some projects have found that a cycle-based coding style produces code that has fewer bugs in synthesis related to glitch free clock trees and sequential elements. Higher-quality code is produced in the coding/testing stage, which results in shorter development times.

RTL code that is incompatible with the cycle-based compiler will not compile on a hardware accelerator’s cycle-based compiler. As a result, it cannot be simulated on a cycle-based hardware accelerator.

Chip-design projects require the interoperability of certain types of EDA tools and the verification environment (e.g., coverage tools). Verification environments that are written in Verilog for hardware acceleration will require additional features that can be enabled for interoperability with these EDA tools during software simulation. This effort will result in lower hardware-acceleration CPS unless the EDA tools, such as coverage tools, are executed in a hardware accelerator.

Hardware-Simulation Performance

The simulation CPS can be improved by making changes to the verification environment that will reduce the simulation-backplane events. These changes include:

  • Remove PLI calls from the testbench when possible.
  • Use vectors that can be compressed and paged in and out of the hardware accelerator.
  • Replace C/C++ abstract models with Verilog models that can be executed in the hardware accelerator.

There are different ways to run the testbenches with either C/C++ abstract reference models or the reference model as outlined previously. The debug functionality isn’t a function of the programming languages that are used in the verification environment. Rather, it is a function of the verification-environment architecture.

Testbenches use stimulus and expected result vectors, which are generated by C/C++ simulators to verify the DUT. During a project cycle, the same test is rerun on the same C/C++ simulator. For comparison, the same comparison vectors are generated and then communicated via PLI to the testbench (or from the testbench to the C/C++ simulator). This approach is an unnecessary waste of compute cycles. Instead, the C/C++ vectors can be saved in a compressed format. The compressed vectors can then be loaded into the hardware accelerator. These vectors can be uncompressed and read into the testbench, reducing the amount of communication between the host and the hardware accelerator.

This approach will reduce the amount of absolute CPU cycles by eliminating the C/C++ simulation cycles while the hardware accelerator is executing the Verilog simulation. The C/C++ simulator only reruns and regenerates the test vectors when one or more of the following dependencies change: the C/C++ simulator, the verification environment, or the test(s). These dependencies can be calculated. The vectors can be regenerated when any of the dependencies are out of date with respect to the target vectors.

To help determine these dependencies, both software-simulator and hardware-simulator performance must be measured in CPS. A verification-environment solution running exclusively on a hardware accelerator should offer a speed on the order of 100X to 300X over a PLI-based verification solution running on a single-processor hardware-simulation environment. The measurement will be based on the performance differences between the software-simulator CPS running on the host and the hardware-accelerator-simulation CPS. This estimate is based on our experience with simulation on a hardware accelerator using testbenches with and without PLIs.

Amdahl’s Law

In a typical hardware-acceleration environment, the testbench runs on a host computer. That computer “drives” the simulation of the design under test on a hardware accelerator. The process executing on the host computer is considered a sequential process (SP). The design being simulated on the hardware accelerator is considered a parallel process (PP). Consider “Amdahl’s Law for Parallel speedup” in this context:

(1)

Where:

  • Shostis the fraction of the total simulation runtime (testbench overhead) executed on the host’s processor(s) that are external to the hardware accelerator (i.e., testbench-runtime overhead)
  • (1- Shost) is the fraction of time needed to simulate the design in the hardware accelerator
  • (P=PPFreq/SPFreq) where P is the ratio of hardware accelerator simulation frequency (parallel process frequency-PPFreq ) to the software simulator simulation frequency (sequential process frequency-SPFreq). Assume P is large and approximates the simulation frequency of the parallel process.

Low-overhead testbenches are defined as having a simulation-backplane/host-processing overhead ranging from 0.025 to 0.03 when running on a hardware accelerator. Simulation-backplane/host-processing overhead in that range results in a speedup of 33X to 40X over a software simulator. An ATPG testbench is an example of a testbench that has low overhead compared to other testbenches.

Synthesizable testbenches can be compiled by the hardware compiler without any changes to the testbench code. In this case, “simulation-backplane” communication overhead is minimized. The hardware accelerator won’t be interrupted by simulation-backplane events. It can therefore run at its maximum speed.

Amdahl’s Law of Parallel Speedup can be transformed to one that measures hardware-accelerator speedup as a function of simulation-backplane events:

(2)

Shost = 1 - Shwa

(3)

where Shwa is the fraction of the simulation runtime measured in CPS in the accelerator. ‘P’ is the ratio of hardware-accelerator simulation frequency to software-simulator simulation frequency (P = PPFreq/SPFreq). If the accelerator is 10,000X faster than the software simulation and the testbench on the hardware accelerator interacts with the host after every simulation cycle, Shwa will approach 1 while P = 10,000. This effectively reduces the CPS accelerator performance to the same order of magnitude of the software-simulation CPS. If Shwa/P = 8000 (80% simulation utilization rate...20% overhead for accelerator and host interaction and processing), the speedup is a mere 5X.

(4)

Making effective use of a hardware accelerator requires that interactions between the host and the hardware accelerator be either “free” (zero latency) or virtually eliminated. Note that it is latency and not throughput that determines the bottleneck in the network. Latency is a limiting factor on the system performance (see Figure 4).

Phases Of Verification

There are different phases of verification in a project. A testbench should be flexible so that it can be used in the different verification phases. The features required for the current phase should be selectable through a command-line argument or other runtime configuration mechanism.

In the early phase, projects are limited by engineering resource constraints rather than hardware-performance limits. Designers have to code and debug the blocks, interfaces, and interaction with other blocks using a Verilog compiler, simulator and waveform viewer. They need to dump out waveform files and step through the design.

During the intermediate verification phase, regression suites are tested on large simulation ranches. Coverage tools are run to measure the test-suite coverage of the DUT. The testbench should run as fast as possible to minimize the cost of running the regressions on the simulation ranch or hardware accelerator.

Finally, the DUT undergoes final verification prior to tape-out. Large simulations, such as booting operating systems on a system-level testbench, are used to exercise the DUT (e.g., a new CPU). They exercise it to find corner cases that weren’t revealed using the normal regression suite. Moreover, simulations that take a long time to execute should be run on hardware accelerators. The same environment that was used for normal regression runs should be used on the hardware accelerator. In a project’s later stages, developing new testbenches normally isn’t an option due to schedule and resource constraints.

A verification environment can be designed to be reusable if it contains components that are parameterized. The idea behind a reusable testbench follows: If there are abstraction layers for interfaces and there are parameterized components like memories in the testbench, a new project would be able to leverage the testbench by plugging in the DUT. Tests could then be run right away.

Reconfigurable verification environments reduce the amount of time and effort required to bring up a new verification environment. Huge cost savings result from reusing verification-environment components. In addition, the verification environments that utilize reusable verification components are up and running sooner and with fewer bugs. Software-component maintenance costs go down over time as more and more bugs are found and fixed. A cost reduction also is associated with reusing software components, such as verification environments and test suites. Incremental bring-up costs on a new project are purely a function of the porting/configuration cost along with any new features that are implemented.

It’s possible to build a testbench as described in this article. Experience tells us that removing PLI calls from the testbench results in cycles-per-second improvements on the order of 100X to 1000X. The amount of improvement depends on the hardware accelerator being used for simulation.