Philips Semiconductors - Newsroom; Backgrounders;

1998-09-14,
Technical Backgrounder From Philips Semiconductors

Philips Semiconductors' R.E.A.L. DSP Core for Low-Cost Low-Power Telecommunication and Consumer Applications

Why consumer oriented DSP is different

Without the advent of digital signal processing many of the latest advances in personal communications and multimedia entertainment would not have been possible. Digitising analog signals and processing them in the digital domain not only eliminates the temperature drift and ageing problems associated with analog circuitry, it also means that adaptation to changing standards and the inclusion of advanced new performance features can be accomplished with software changes rather than changes to the hardware. Digital signal processing has also had an impact on the digital world, dramatically increasing the storage density of hard/optical disk drives and the speed with which data can be transmitted.

The latest generation of consumer product DSP applications (for example, baseband signal processing in digital mobile/cordless telephones and multi-channel audio decoding in digital TV transmissions) requires considerable processing power - comparable in terms of MOPS (Million Operations Per Second) to that of a desktop PC. In addition, most of these applications are 'real-time', putting severe constraints on the time available to calculate the required DSP algorithms.

With the ready availability of both DSP and RISC processors that can be clocked at several hundred MHz, it is tempting to solve these real-time processing problems simply by increasing the clock speed. However, the penalty you pay for doing so is greater power consumption, which increases linearly with clock frequency. In applications such as mobile and cordless telephones, where extended standby and talk times are key selling features, it is simply unacceptable to adopt a solution that increases the drain on the battery. Because consumer applications are also very price sensitive, the challenge for designers is to find a solution that combines high functional performance (as opposed to raw MIPS performance), low power operation and low cost in a single implementation solution.

Many consumer applications also call for a highly integrated implementation, in which the DSP processor is part of a complete system-on-silicon solution. If circuit miniaturisation is important, such single-chip systems-on-silicon have obvious advantages, and if the volume is high enough they will also be the lowest cost option. Because of the absence of glue logic and peripheral components, single-chip solutions often also achieve the lowest power consumption.

However, in addition to the technical advantages of adopting a system-on-silicon approach, there is also a strong business advantage in doing so. In a world that is more and more dominated by standards (such as GSM and DECT for wireless telephony and MPEG for digital video/audio) the only way for equipment manufacturers to differentiate themselves from their competitors is by designing performance improvements or extra features into their products. The expertise within a company that allows it to achieve this - often referred to as a company's 'intellectual property' (IP) - needs careful guarding so that it cannot easily be copied. In this respect, system-on-silicon solutions have the IP protection advantage of being almost impossible to reverse engineer.

The value of intellectual property in the arena of DSP is particularly significant for several other reasons as well. The algorithms involved require in-depth knowledge of complex arithmetic operations that are highly sensitive to bit inaccuracies. In addition, real-time DSP applications involve a very wide design space in which hardware and software co-design is the only way to arrive at optimal solutions. As a result, DSP expertise is in short supply and therefore expensive to acquire.

Possible approaches to providing the necessary DSP performance
Because power consumption considerations preclude the possibility of simply increasing the clock frequency in order to meet the required MOPS performance, most consumer product applications utilise DSP processors that feature a high degree of parallelism. This parallelism can be achieved in a variety of ways, each having its own advantages and disadvantages.

Deep Pipelining. Although not strictly a form of parallelism, deeper pipelining means that more instructions can be executed in the current processing sequence, thereby increasing the instruction throughput rate. However, deep pipelines can suffer from a high level of latency under program jump, branch and task switching conditions because the pipeline needs to be purged and refilled with a new section of program code. Mechanisms to overcome this latency (such as branch prediction, speculative calculation, and pipeline bypassing) have been employed in some advanced RISC processors, but they all come at a price in terms of additional silicon area and higher power consumption. Because they disturb the temporal predictability of program execution, these techniques would in any case prove complicated in real-time DSP processors.

Instruction-Level Parallelism. This is a true parallel processing technique that adds additional execution units (such as an additional multiplier/accumulator) to the DSP so that two or more data paths can be driven in parallel by a single instruction sequencer. Bearing in mind the inherent parallelism that exists in most DSP algorithms, instruction-level parallelism is an ideal technique for increasing the speed of algorithm execution. For example, the addition of a second multiplier/accumulator can speed up algorithms such as block-based FIR filters and FFTs by a factor of more than two compared to single multiplier/accumulator architectures, without any need to increase the clock frequency.

Data-Level Parallelism. Many of the multiply/accumulate algorithms found in typical DSP applications involve the multiplication of a large number of data values by a set of fixed coefficients. Where this high degree of regularity in the data structure exists, advantage can be gained by packing two or more data values into a single data word that can be fetched from memory in a single cycle. This technique can be used to best advantage when it is combined with instruction level parallelism so that all data path elements are kept busy during each instruction cycle.

Task-Level Parallelism. Where no single DSP processor is capable of executing the algorithms fast enough, the task can be shared between two similar processors (if necessary, on the same chip as in Philips Semiconductors DUET AC3/MPEG multi-channel audio decoder). Such architectures, in which both processors are essentially similar, are referred to as homogeneous. A multi-processor approach can also be used to advantage where different algorithms, or different parts of the same algorithm, require fundamentally different DSP architectures in order to execute efficiently. In this case the architecture is referred to as heterogeneous.

The R.E.A.L. DSP Architecture
Philips Semiconductors' R.E.A.L. DSP is based largely on techniques for instruction-level parallelism that were first proven in the design of its previous KISS and EPICS families of DSP cores, but R.E.A.L. also includes elements of both data-level and task-level parallelism. R.E.A.L. DSP features an advanced dual Harvard architecture with two 16-bit data bus pathways to its Data Computation Unit (DCU) - see Figure 1. The DCU comprises two 16x16-bit signed multipliers and two 40-bit Arithmetic Logic Units (ALUs) each providing 32 data bits and 8 overflow bits. The data values generated by these ALUs are stored in four 40-bit accumulators.

[Figure 1]
Figure 1 - R.E.A.L. DSP's dual Harvard architecture and dual multiplier/accumulator more than double the execution speed of block-based FIR and FFT type algorithms

This dual multiplier/accumulator architecture, first developed by Philips Semiconductors some five years ago, is an example of how the company's extensive applications knowledge - its 'intellectual property' in DSP - has resulted in a core that is optimised for the calculation of sum-of-products based algorithms such as those used extensively for the filtering of baseband signals in wireless telephony. Provided that both multiplier/accumulators can be kept fully occupied, this architecture can halve the time taken to execute these algorithms, without having to increase the clock speed.

To overcome the potential problem of having to fetch two data samples and two coefficients from memory during each instruction cycle (one data sample and one coefficient for each multiplier/accumulator), advantage can be gained from the fact that practical FIR filter algorithms process the digitised signal samples in blocks. For this finite block size processing, R.E.A.L. DSP's dual multiplier/accumulator architecture can calculate two output samples simultaneously - both calculations using the same coefficient data. As a result a single fetched coefficient can be used simultaneously by both multiplier/accumulators. In addition, because each input sample is used in the calculation of both output samples, separated in time by a single clock cycle, it can be fetched from memory once and stored for one clock cycle in one of the input registers shown in Figure 1.

Optimum use of the data path through R.E.A.L. DSP's dual multiplier/accumulator architecture is achieved by two independent Address Computation Units (ACUs), both of which can be active within the same instruction cycle. Each ACU features eight address pointers (organised in four banks of two) with automatic context switching of pointers during interrupts and capabilities for modulo protected pointer updates and bit reversed addressing.

The R.E.A.L. DSP dual multiplier/accumulator architecture accelerates the calculation of the vast majority of block processing multiply/accumulate type algorithms without the penalties (in terms of silicon area and memory complexity) of having to use 32-bit wide data buses. For example, a 64-tap FIR filter with a block size of N takes 8 + 37 x N cycles to execute, compared to a minimum of 64 x N cycles for a single multiplier/accumulator architecture.

Application Specific Instructions Drive Massive Parallelism
However, not all DSP algorithms are as regular as FIR filters. Other applications, such as MPEG decoding for set-top boxes, or speech coding in GSM phones, utilise algorithms that contain complex sequences of arithmetic operations. To cope with this type of algorithm, R.E.A.L. DSP's multiplier/accumulator structure can be operated as two multipliers, four independent 16-bit ALUs and eight independent 16-bit registers, together with several shifters and saturation units - see Figure 2. Each one of these elements can be individually controlled from within user-defined 96-bit Application Specific Instructions (ASIs), giving programmers maximum use of a massive amount of parallel processing. ASIs can also be used to control the address computation units (ACUs) and data transfers within the DSP core.

[Figure 2]
Figure 2 - Driven by user-defined 96-bit Application Specific Instructions, the R.E.A.L. DSP core adapts to the execution of irregular DSP algorithms

In order to call these instructions from R.E.A.L. DSP's 16-bit program memory, selected ASIs are stored in a look-up table and pointed to by a special class of 16-bit opcode. They are typically only needed to speed up the execution of the tight program loops that make up speed critical parts of algorithm processing, but because these are typically also the loops that execute 90% of the time, the ability of ASIs to customise the R.E.A.L. DSP's instruction set to the target application leads to dramatic improvements in performance. As many as 256 different ASIs can be selected, which has already proved more than enough for complete target applications in mobile telephony and audio processing.

Because ASIs are fully accounted for in R.E.A.L. DSP's assembler, linker and instruction set simulator, the effect of different ASIs on the execution speed of algorithms can be evaluated from within the programming environment. If silicon implementation of the look-up table takes the form of RAM, different sets of ASIs can also be down-loaded to the chip to evaluate their performance, or even to customise the DSP core on-the-fly in order to process radically different algorithms.

Hardware Acceleration Solves Speed-Critical Computations
Even if a suitable set of ASIs cannot be found to increase algorithm processing to the necessary speed, Philips Semiconductors' R.E.A.L. DSP has yet another means to achieve the required performance. The VHDL synthesis model of the R.E.A.L. DSP core allows designers to add Application Specific Execution Units (AXUs) at specified points in the data path or in the address computation units. By executing parts of the algorithm or address computation in dedicated hardware, these AXUs represent the fastest possible processing solution for a given clock frequency. Philips Semiconductors already has a library of AXUs, such as dedicated shifters, but designers can also design their own AXUs and fit them in to the R.E.A.L. DSP architecture.

Because R.E.A.L. DSP is fully described at VHDL register-transfer level, it is strongly process independent. The existing RD160xx R.E.A.L. DSP cores are implemented in 0.35 micron to 0.25 micron CMOS - operating at clock speeds up to 85 MHz under worst-case operating conditions (2.2 V supply and +85 degree C) and achieving a sustained processing performance well in excess of 850 MOPS. Under typical conditions they can achieve clock speeds of up to 125 MHz, increasing their MOPS rating to 1,250.

The dual multiplier-accumulator version is the RD16020, but there is also a cut-down single multiplier/accumulator version designated the RD16010 core for lower cost/performance applications. New generations are already under development for implementation in sub 0.25 micron CMOS, with increased processing performance and extension of the data pathways from 16 bits to 24 bits to handle the resolution required in complex, high-end audio applications.

Design Tools
With the current state of embedded DSP technology, hand-crafted assembly language programming is still needed to create the time-critical inner loops that make up DSP algorithms, not only to create code that will execute fast enough but also to generate code that is compact enough to fit within realistically sized embedded memories. Programs written in assembler, however, suffer the disadvantages of being time consuming to generate, difficult to maintain and, above all, difficult to re-use. In addition, assembler programs are often poorly documented and only the original programmer can read them efficiently and understand how they work.

In consumer product applications where the re-use of intellectual property (IP) and short time-to-market are critical factors in business success, assembly language programming is not a realistic way forward. The solution is to provide users with efficient high-level language compilers that can handle the complexities of DSP architectures. Because of its widespread use and standardisation, ANSI-C is the obvious choice of high-level language to use.

However, C was not written with DSP processors in mind. It was written for standard Von Neuman architectures (common data/program memory) with a single contiguous memory space and no fixed register structures. DSP processors, on the other hand, typically have Harvard architectures (separate data and program memory) with multiple address spaces, dedicated register sets and a high degree of parallelism, all of which affect program compilation. In addition, ANSI-C doesn't support the fixed point arithmetic employed by most DSPs.

As well as supporting R.E.A.L. with an assembler and linker to meet current programming expectations, Philips Semiconductors is also working on C compilers for R.E.A.L. DSP. Using the CoSy compiler development platform from ACE Associated Compiler Experts b.v. (Amsterdam, The Netherlands), one of the world's leading compiler technology companies, Philips Semiconductors has already produced a fully validated ANSI-C compiler that is capable of handling the control code that surrounds DSP algorithms. Work is now underway, in collaboration with ACE, to add DSP specific extensions to the compiler so that it can also generate the cycle-count optimised DSP code which, although it only makes up only 10% of the program, runs 90% of the time. The flexible framework of CoSy, with its rigorously defined intermediate representation, opens up the possibility of making powerful optimisation strategies available to programmers. The existing development platform is being complemented by a state-of-the-art Code Development Framework (CDF), which in addition to providing programmers with an integrated development environment and a common graphical user interface, will also include (among other features) configuration management and Real-Time Operating System (RTOS) support.

Whichever way code is generated, the R.E.A.L. DSP development platform includes powerful tools for software and hardware simulation and debugging. An instruction set simulator allows designers to experiment with different sets of ASIs or even to modify R.E.A.L. DSP's hardware architecture by adding AXUs as a result of the simulation results. Emulation chips, capable of executing up to 50 million instructions per second, are available for real-time evaluation of program performance, and the program can also be simulated on a VHDL model of the final core for interconnect verification. Simulated and verified designs are supplied in the form of compiled VHDL, a synthesised netlist or placed and routed blocks, allowing R.E.A.L. DSP cores to be integrated seamlessly into ASIC design flows to create complete systems-on-silicon.

Creating these systems-on-silicon requires IP contributions not only in the area of DSP, but also in areas such as embedded microcontrollers, peripheral devices, memory, A/D and D/A conversion, I/O interfacing and mixed signal circuitry. More importantly, it requires IP input in the areas of system level simulation tools, multi-processor architectures and high bandwidth on-chip bus structures. Philips Semiconductors possesses a high level of IP in all these areas, which together with its in depth knowledge of real applications, puts it in a strong position to create the low-cost, low-power systems-on-silicon that will power the next generation of high-technology consumer products.

Philips Semiconductors, a division of Koninklijke Philips Electronics NV, headquartered in Eindhoven, The Netherlands, is the ninth largest semiconductor supplier in the world and the third largest supplier of discretes in the world. Philips Semiconductors' innovations in digital audio, video, and mobile technology position the company as a leader in the consumer, multimedia and wireless communications markets. Sales offices are located in all major markets around the world and are supported by systems labs.

Download this white paper in PDF format (968 KB)