Computer Science | College of Engineering logo
reseach (collage)
adaptive computing systems program (title)

Dr. Antonio, seated, poses with part of the research team and hardware. The team members are from left, Jack M. West, Research Scholar and PhD, student, Jeffrey T. Muehring, Research Scholar and PhD, student, and Brian F. Veale, Research Assistant and Master's student. photo:JAMS,RANGER

Configuring Combined GPP/DSP/FPGA Systems for Minimal Size, Weight, and Power (SWAP)

Sponsored by

DARPA Tactical Technology Office Adaptive Computing Systems Program

Co-Principal Investigators

John K. Antonio and Sudarshan K. Dhall, School of Computer Science University of Oklahoma

Research Scholars

Jeffrey T. Muehring and Jack M. West

Graduate Research Assistants (Current)

Hongping Li, Sirirut Vanichayobon, Seok-Hyun Ko, and Manoj Anan Suresh Kumar Graduate Research Assistants (Past): Nik Gupta, Timothy A. Osmulski, and Brian F. Veale

OBETIVE

The objective of this effort is to investigate the advantages of combining reconfigurable hardware technology (i.e., FPGA-based boards) and embeddable multiprocessor systems technology. The goal is to demonstrate that for a given computational load - associated with instances of various embedded applications - the total size, weight, and power (SWAP) can be reduced from 50% to 90% by integrating FPGA-based boards into the embedded computational platform. It will be demonstrated, both theoretically and through the implementation of a prototype system, that a significant portion of the computations are more efficiently performed (in terms of SWAP) on a heterogeneous FPGA/multiprocessor-based platform. Reductions in size and weight of at least 50% will be demonstrated. Reductions in power consumption will be application dependent; savings of up to 90% will be shown for some cases.

APPROACH

The overall approach for the effort is divided into three phases. During the first phase, systematic techniques based on simulation and mathematical programming have been developed to determine optimal configurations for the proposed FPGA/multiprocessor-based platform for given application domains and scenarios. These techniques are built upon previously developed approaches for optimally configuring multiprocessor systems. In order to ensure that these simulation and analysis techniques are realistic, a prototype system - constructed with cooperation from Mercury Computer Systems, Inc. and Annapolis Micro Systems, Inc. - is also being developed. In addition, close contact with various defense-related organizations will ensure that the models and approaches are realistic with respect to military end-users.

During the second phase of the effort, focus is on the practical use of the constructed FPGA/multiprocessor-based system. Also, the design methodology for developing integrated systems consisting of both FPGA and multiprocessor technologies is being explored. Thus, the prototype system will be the centerpiece on which: (1) the optimal configuration techniques developed in the first phase are tested and (2) design methodologies for the practical use of FPGA/multiprocessor-based systems are developed.

A full demonstration will be developed during the third phase, which will illustrate the advantages of combining both FPGA and multiprocessor technologies into an integrated system. The demonstration will be based on two radar processing applications: SAR (synthetic aperture radar) and STAP (space-time adaptive processing). In particular, it will illustrate optimal configuration, programming, and execution of the prototype system for use in a "mixed-mode" setting in which the same hardware is optimally configured for instances within these two application domains.

RECENT ACCOMPLISHMENTS

Design, Implementation, And Trade-Off Study Of FIR Filters On FPGAs
Filtering operations represent a significant amount of the computational load associated with both SAR and STAP. We have implemented and tested two distinct approaches. One implements a serial multiply and a parallel (reduction) add circuit. The other implements a parallel multiplying circuit with a serial adder. The advantage of the first include the ease and efficiency with which it is placed and routed; a disadvantage is its lack of scalability and modularity due structure of the reduction tree used in the adder circuitry. The second design is highly scalable, requiring little effort to add more filter taps; a disadvantage, however, is the presence of a high degree of signal fan-out at the place and route level, thus causing inefficiencies in power and space. We are investigating hybrid approaches that attempt to capture the advantages of both designs for large implementations spanning several chips and boards.

Testing and Calibration of a Network Simulator For Embedded Multiprocessor Systems
This simulator, which was implemented last year in Java, is generic in that the parameters of the various network objects and their interconnections can be modified to accurately model different available systems (from different manufacturers) as well as possible future network designs. The simulator is designed to model "phased" communication patterns, i.e., communication patterns in which a group of messages begin entering the network at about the same time. Phased communication requirements, which can result in performance bottlenecks if not properly mapped and scheduled on the multiprocessor system, are common to many embedded applications, including STAP. The simulator will be used to aid in optimally mapping and scheduling required communications, thereby improving overall system efficiency. Currently, the simulator is being applied to predict the communication time associated with execution of the RT_STAP benchmark from MITRE on a Mercury System computing platform.

Calibration Of A Probabilistic Power Prediction Tool For FPGAs
The current version of this tool, which is implemented in Java, is for Xilinx 4028 and 4036 FPGAs. The tool requires the following two inputs: (1) the configuration file for the FPGA for a given circuit design and (2) a probabilistic characterization of the input signals to the FPGA chip. The tool then computes the activity (i.e., relative frequency) of every internal signal of the FPGA design. The signals are partitioned based on their physical length on the FPGA chip. Each signal of a given length is assumed to be driving a capacitance value that depends on signal length. Calibrating the simulator involves determination of these capacitance values based on actual power measurements taken from the FPGA. Unfortunately, our attempts to measure power have not been repeatable, and thus reliable calibration has not yet been accomplished.

Optimal Mapping And Scheduling Techniques For STAP
STAP involves three phases of processing for data of a 3-dimensional data cube. At each phase of processing, vectors of data along one dimension of the data cube are processed. The manner in which the vectors are mapped to processors, for each phase of processing, affects the required communication pattern necessary between computational phases. Furthermore, the orders (i.e., schedules) used in sending the queue of messages to their destinations, from each processor, impacts network performance. We have formulated a two-phase optimization approach for mapping and scheduling for STAP. The first phase of the optimization involves solving the mapping problem by attempting to map the vectors so that the resulting communications utilize a minimal amount of the interconnection network hardware. The second phase applies a Genetic Algorithm (GA) approach to optimally schedule the message queues at each processor. The network simulator, described earlier, is used to estimate the communication time associated with each scheduling considered by the GA.

Implementations And Evaluation Of FPGA Inner Product Co-Processor Designs
Inner product calculations are core calculations associated with both direct and indirect techniques for solving adaptive STAP weights. Two different inner product circuits were designed and implemented, each for a single FPGA. One design implements a "multiply-and-add" architecture and the other implements a "multiply-and-accumulate" architecture. The first design inputs four operands per cycle and has two multipliers and one adder circuit. The second design inputs two operands per cycle and has one multiplier and one accumulator. Both designs are heavily pipelined to increase the speed at which they can operate. Each of the two basic designs were implemented for two different data types: 16-bit floating point and 16-bit integer. We have discovered that although the first design contains more hardware (two multipliers rather than one), its internal signal lengths are much shorter than those of the second design. This is probably because there is naturally one direction of data flow in the first design, whereas the second design involves feedback signals associated with the accumulator. Assuming a linear model for capacitance as a function of signal length, our power prediction simulator indicates that the second design does indeed consume more power, for a give data set, than does the first.

CURRENT PLAN

Multiple Chip And Multiple Board Implementations For FIR filters
Realistic systems may involve filters with hundreds of taps. Thus, designs must be able to span multiple chips and boards of FPGAs.

Utilization Of Network Simulator For Optimal Mapping And Scheduling For STAP.
The core of the scheduling component of the STAP optimization requires that the accurate prediction of communication times. Thus, after being thoroughly tested and calibrated, the simulator will be used for this purpose.

Calibration Of The Probabilistic Power Prediction Tool For FPGAs.
We will continue in our efforts to obtain repeatable power measurements for actual FPGAs. This is necessary in order to properly calibrate our power prediction tool.

Implementation Of Optimal Mapping And Scheduling Techniques For STAP
Although the two-phase optimization technique has been designed, we need to next implement this and test the results. This will involve integration of the network simulator for evaluation of the quality of each scheduling technique considered.

Extensions Of Current Inner-Product Designs For Complex Data
Our current implementations for inner-products on FPGAs are for real integer and floating point data. We will extend these designs for complex integer and floating point data formats.

Implementation Of The Hybrid FPGA / Multiprocessor-Based System.
We have already designed and architected the hybrid FPGA /multiprocessor-based system. Now that the hardware has arrived, we will begin interfacing the FPGA-based boards with the Mercury multiprocessor system and develop the prototype system.

University of Oklahoma, Engineering Dean's Office © 2005 | DisclaimerUniversity of Oklahoma logoupdated 11-feb-04