Guidelines for DSP Development
by Robert Oshana
To develop DSP applications, a number of challenges must be overcome, ranging from which architecture to use to which tools will meet your needs. Early analysis is a key to meeting these challenges, as this article explains.
Digital signal processing is the method of processing signals and data to enhance or modify those signals or to analyze them to determine specific information content. A typical DSP system, shown in Fig. 1, consists of a processor and other hardware used to convert outside analog signals to digital form and possibly back to analog (continuous) form. There is nothing mystical about DSPs. Think of them as application-specific microprocessors, built to handle digital signal algorithms which aren’t usually found on general purpose processors because of their complexity. But the software development issues associated with these devices are similar to other general-purpose processors. DSPs are becoming common in all product areas, including military and aerospace systems, embedded applications, and PC-based systems.
Figure 1
Is your application right for a DSP?
Digital signal processors were developed to solve a particular class of problem. Digital signal processing is a way to represent signals as ordered sequences of numbers and techniques to process those sequences. Some important reasons for signal processing include elimination or reduction of unwanted interference, estimation of signal characteristics, and transformation of signals to produce more important information. 1 Some common applications of DSP include:
• Radar and sonar
• Communications systems
• Process control
• Image processing
• Audio applications
Each DSP application is different. One of a system designer’s first tasks is to determine how much processor is required to perform the job. Figure 2 shows typical performance ranges for some DSP applications. Simple control-based applications don’t need high-performance DSPs, whereas higher-performance ranges are required for applications such as radar and sonar.
Choosing a processor
Digital signal processors are in many ways similar to general-purpose processors. One of the main differences between the two is the DSP is typically optimized to process certain signal-processing functions like filtering and Fast Fourier Transforms (FFT) very quickly. Many steps in these functions are executed with single-cycle instructions in a digital signal processor. DSPs are also designed with scalability in mind. Many real-time signal-processing functions decompose their processing over many processors to gain a significant improvement in processing time. Decomposition can be done either temporally or spatially, as shown in Figure 3. Processors that house these functions need to have fast inter-processor communication, which a DSP usually accomplishes by using high-speed I/O ports and shared memory.
A growing number of options for a general-purpose DSP is available to system designers. Options include:
• Application-specific integrated circuits (ASIC); these devices can be used as DSP co-processors but are not very flexible for many general-purpose signal processors.
• Reduced instruction set computing (RISC) processors; the extremely fast clock speeds for these devices allow them to perform well in certain DSP applications. Scalability and other real-time (i.e., predictability) issues still remain for these devices. DSPs are made to handle real-time deterministic applications.
• Field programmable gate arrays (FPGAs); these devices are fast and can do certain DSP functions very quickly. But they are also more difficult to develop in comparison to a DSP, in which a simple program can do the same function.
• Host signal processing; this area is becoming a more popular area in the DSP arena. Host signal processing generally refers to executing DSP algorithms on a PC (referred to as the host). Many of the lower-end multimedia applications are done this way. But PC-based DSP still lags behind a true DSP solution for high-performance DSP applications.
The complexity of today’s signal-processing applications and the need to upgrade often make a programmable device such as a DSP a more attractive alternative than a customized hardware solution.
You should consider several factors when making a processor selection. Some of them are: 2
• Cost
• Scalability
• Programming requirements
• Algorithm complexity and type
• Tools support
• Time to market
• Performance
• Power consumption
• Memory usage
• I/O requirements
Until recently, the application dictated which processor to choose. Complex signal-processing algorithms required DSPs because of the built-in signal-processing architecture that made performance for these classes of algorithms much better. Control-based or finite state machine software applications chose a general-purpose processor. But now the speed of general-purpose processors has increased to the point where many signal-processing applications previously unable to run on a general-purpose processor can now execute with excellent performance. In addition, general-purpose processor manufacturers have been adding more reduced instruction set (RISC) instructions and capabilities to their processors to perform specialized DSP operations. DSP manufacturers, likewise, are adding more complex instruction set (CISC) instructions to their processors to provide full system solutions (control software as well as signal-processing software) on a single chip. Some recent studies show that general-purpose processors perform better than their DSP counterparts in some algorithm benchmarks. 3 However, a general-purpose processor usually takes many more instruction cycles to implement a signal-processing algorithm than a DSP does.
One thing to keep in mind is the programming complexity of the processor you choose. An advanced superscalar or very long instruction word (VLIW) processor can be much harder to program at the assembly level than a single pipelined processor. Chip designers and vendors are beginning to provide more sophisticated development tools to alleviate some of these problems. However, if performance and throughput are important factors in the application and assembly language is the programming choice, development time could go up significantly, depending on the processor architecture (see Table 1).
Another factor in processor choice is the required memory for an application. Typically, in RISC processors the required memory to run an application is higher because it takes more RISC instructions to execute a particular algorithm than with a conventional CISC processor. In some cases the increase can be dramatic (Hakkarainen, 1997). Research, benchmarking, and prototyping should all be used to determine memory requirements for a particular application. The possibility of trading performance for memory always exists, and memory can be optimized at the cost of application performance. Because DSP algorithms involve performing tight loops of operations over many data points, if just these small kernels are optimized, the performance improvement is substantial.
How much is enough?
Many manufacturers of processors advertise the speed of their processor in terms of how many operations or instructions can be performed per unit of time (usually a second). Although these quotes may be true for an ideal case, often the actual performance is much lower. What is more important is how fast your application and algorithms will run on the device within the rest of the system. If you decide to use advertised benchmarks, attempt to choose benchmarks that are similar to algorithms you’ll be using in your design. This strategy makes a big difference when trying to determine how much processor you need, and is especially helpful when using some of the higher-performance DSPs with optimizing compilers. Subtle differences in algorithm structure can be the difference in triggering the compiler to optimize a particular piece of code. The resultant performance measurements can differ by orders of magnitude.
DSPs come in many varieties—some that are good at computing FFTs, others that are good at I/O, and so forth. Determine the most critical aspects of your design and attempt to match them to a processor, if possible. Attempt to match the following DSP features to your application:
• CPU—CPUs can be fixed point, or floating point for more scientific applications, optimized for FFT computation, and may have a variety of other features
• Direct memory access—DMA is used in applications demanding high data rate and I/O. A DSP designed for high-rate data transfer will have one or more DMA controllers that may be used to transfer data without the intervention of the CPU
• Memory access—The basic types of processor architecture are von Neumann and Harvard. Von Neumann architectures are the traditional design, using one interface for data and program space. The Harvard architecture, more common in DSP processors, uses two or more memory access buses to allow simultaneous access to multiple banks of memory for fast data access. This architecture results in certain DSP operations such as a multiply and accumulate (MAC) to be done in a single cycle
• On-chip memory—Internal memory is a valuable resource for DSPs. This memory is used to store intermediate variables and is much faster to access than external memory. Effective management and use of on-chip memory may result in significant performance improvements
• I/O port—DSPs designed to support high data throughput have one or more communication ports to allow fast transfer of data in and out of the processor. I/O ports are generally controlled by an associated DMA controller, allowing data to be streamed in and out of the processor while the CPU is busy crunching data. In real-time applications for the military, support for multiprocessing configurations and the need for high bandwidth I/O is important.
Real-time operating systems
One of the key elements driving DSP solutions to higher and higher levels of performance has been the evolution of real-time operating systems (RTOS). In fact, some would argue that operating systems have evolved to the point that developing code for multiprocessor DSP applications is a trivial extension to just programming a single processor. Purchasing a commercial off-the-shelf (COTS) RTOS instead of developing an OS in-house is now becoming advantageous. RTOSes are now being built specifically for DSPs. The main features of these operating systems include:
• Preemptive priority-based real-time multitasking
• Deterministic critical times
• Time out parameters on blocking primitives
• Memory management
• Synchronization mechanisms
• Inter-process communication mechanisms
• Special memory allocation for DSPs (on-chip)
• Low interrupt latency
• Asynchronous, device-independent, low-overhead I/O
Tools
DSP processors, like many general-purpose processors, come with a standard set of tools provided by the chip manufacturer. Third party vendors supply enhanced tool suites which are, generally, the standard tool suite with a interactive GUI wrapped around them. Other tools are available that can be useful for developing DSP-based systems, including simulators and emulators.
Simulators
Software simulators are available for many common DSPs. These tools let the engineer begin development and integration of software without the DSP and associated hardware being available. Simulators are more common in DSP applications because the algorithms that typically run on a DSP are complex and mathematically oriented. This complexity leads to many opportunities to make errors in design and implementation. Simulators allow the engineer to examine the device operation easily and without having to buy the device ahead of time.
Software simulators for DSPs will generally consist of a high-level language debugger and the actual DSP simulation engine. The simulation engine is a software model of the DSP device. Simulators are useful in the early development phases of software development. These tools are relatively slow due to the all-software implementation of the DSP device. Therefore, one would not want to completely simulate large applications. But for prototyping and proof of concept, simulators are helpful. The typical instruction-level simulator is used for high-level functional verification. These simulators, although relatively fast as far as execution rate is concerned, should not be used for performance analysis. These tools provide the following capabilities to the software designer:
• Analysis of software functionality
• Code tracing capability
• Analysis and porting of operating systems
Another tool that is useful to the DSP developer is the cycle accurate simulator, such as a VHDL simulator. These tools aren’t always available to the designer. Whereas most simulators are instruction-accurate implementations of the device, a VHDL level simulation is usually a cycle-accurate implementation of the device and possibly some of its peripherals. These tools model delays in memory accesses, pipeline stalls, and all other hardware-related functions that simulators ignore. Therefore, to obtain accurate execution estimates, a cycle accurate simulator is the preferred tool to use. It allows precise timing measurements and system behavior. Although slower because of the processing required to simulate every cycle of the processor, they offer a couple of big advantages:
• Modeling of all aspects of the target processor (pipeline, cache, memory access, and so on)
• Capability to attach external peripherals to the simulator
In the past, in-circuit emulators were the only tool available to assess system performance. They required the design to be committed to silicon. With the available power of today’s PC and low-end workstations, simulation is now possible at a relatively lowprice. Simulating and verifying much of the functionality before committing to the design is possible. Because simulators are becoming more accurate, many programmers today are using simulation to verify much of their design.
Most of today’s simulators allow simulation of entire systems. This includes but is not limited to:
• The processor
• On-chip peripherals
• System-level peripherals
• Other peripheral hardware devices
• The operating system
• Application software
Simulation can be done at various levels of abstraction, depending on the phase of a program. An accuracy vs. performance trade-off exists when modeling at different levels of abstraction. Regardless of which level of abstraction you use, keep in mind that simulators cannot model everything and should not be a replacement for running on the real hardware.
Emulators
Another useful tool for DSP developers is the emulator. The purpose of an emulator is to provide the engineer access to the DSP(s) and peripherals in a non-intrusive way, to aid in debugging operations and hardware/software integration. Emulators allow engineers easy access to hardware registers and memory, allowing reading and writing to these locations. Emulators also support other common functions, such as breakpoints, single stepping, and benchmarking. Most emulators are non-intrusive, both spatially and temporally. “Spatially non-intrusive” means the emulator doesn’t require any additional hardware or software in the target environment. “Temporally non-intrusive” means the emulator doesn’t prevent the processor or system from executing at its full speed. These two requirements are important when performing hardware/software integration.
Because of the shrinking transistor size in DSP processors (as well as other chips), manufacturers are now starting to put emulation logic in the chip itself. A common chip/emulator interconnect standard being used today is the Joint Test Action Group (JTAG) interface. This interface provides the ability to perform board-level testing and requires some on-chip logic to implement Figure 4.
Parallel processing applications can also be supported using emulation tools. This is one area where DSP emulation tools provide a big advantage over general-purpose processors. In parallel processing systems, the scan interconnection is daisy chained between the various processors. An emulator controls each of the DSPs. A multitasking operating system controls each of the separate DSPs in a separate window. The number of windows can become a little cumbersome as the number of devices being emulated grows Figure 5.
There are many documented cases of software developers screaming that the hardware is broken because their software, which ran fine on the simulator, isn’t working on the emulator. In many cases, it was actually the software that was broken. Emulators catch many timing-related problems with software that can’t be found on a simulator. This is because simulators are only instruction-level accurate—running on the real hardware is a completely different story. The development environment for DSP applications can either be a PC or a workstation. Tools exist for either of these platforms.
Programming issues
Historically, DSP programming has been performed at the assembly language level in order to achieve the desired level of performance. DSP developers should avoid programming in assembly language for several reasons: it isn’t very portable; it increases time to market; it’s harder to maintain; and it’s harder to write.
In some industries, there are even requirements that limit the use of assembly language to a certain total percentage of the code. These restrictions exist mainly for maintainability and life cycle cost issues. DSP manufacturers are beginning to recognize the limitations in this area and are designing and bringing to market more sophisticated tools for their DSPs, as well as designing more portable languages. Seeing C/C++ compilers for DSP as well as Java environments becoming more common. Parallel devices and VLIW architectures make the job of efficient assembly programming an order of magnitude harder. Manually pipelining and optimizing an algorithm in a VLIW device without a cycle accurate simulator is almost impossible, and requires a lot of trial and error.
In order to alleviate these problems, tools are being developed to shield the developer from having to worry about the explicit parallelism in the chip architecture. For example, an assembly language optimizer developed for the TMS320C6xx VLIW device allows a programmer to write a serial assembly language implementation and then parallelize it to run efficiently on the device. The programmer doesn’t need to allocate specific registers (a virtual register set is used). Although this tool eliminates much of the complexity of assembly language programming of these devices, it still can’t totally replace manual programming.