Configuring Combined GPP/DSP/FPGA Systems for Minimal
Size, Weight, and Power (SWAP)
Sponsored by |
DARPA
Tactical Technology Office Adaptive Computing
Systems Program |
Co-Principal Investigators |
John
K. Antonio and Sudarshan
K. Dhall, School
of Computer Science University of Oklahoma |
Research Scholars |
Jeffrey T. Muehring and Jack M. West |
Graduate Research Assistants (Current) |
Hongping Li, Sirirut Vanichayobon, Seok-Hyun
Ko, and Manoj Anan Suresh Kumar Graduate Research
Assistants (Past): Nik Gupta, Timothy A. Osmulski,
and Brian F. Veale |
OBETIVE
The objective of this effort is to investigate the
advantages of combining reconfigurable hardware technology
(i.e., FPGA-based boards) and embeddable multiprocessor
systems technology. The goal is to demonstrate that
for a given computational load - associated with
instances of various embedded applications - the
total size, weight, and power (SWAP) can be reduced
from 50% to 90% by integrating FPGA-based boards
into the embedded computational platform. It will
be demonstrated, both theoretically and through the
implementation of a prototype system, that a significant
portion of the computations are more efficiently
performed (in terms of SWAP) on a heterogeneous FPGA/multiprocessor-based
platform. Reductions in size and weight of at least
50% will be demonstrated. Reductions in power consumption
will be application dependent; savings of up to 90%
will be shown for some cases.
APPROACH
The overall approach for the effort is divided into
three phases. During the first phase, systematic
techniques based on simulation and mathematical programming
have been developed to determine optimal configurations
for the proposed FPGA/multiprocessor-based platform
for given application domains and scenarios. These
techniques are built upon previously developed approaches
for optimally configuring multiprocessor systems.
In order to ensure that these simulation and analysis
techniques are realistic, a prototype system - constructed
with cooperation from Mercury
Computer Systems, Inc.
and Annapolis
Micro Systems, Inc. - is also being
developed. In addition, close contact with various
defense-related organizations will ensure that the
models and approaches are realistic with respect
to military end-users.
During the second phase of the effort, focus is
on the practical use of the constructed FPGA/multiprocessor-based
system. Also, the design methodology for developing
integrated systems consisting of both FPGA and multiprocessor
technologies is being explored. Thus, the prototype
system will be the centerpiece on which: (1) the
optimal configuration techniques developed in the
first phase are tested and (2) design methodologies
for the practical use of FPGA/multiprocessor-based
systems are developed.
A full demonstration will be
developed during the third phase, which will illustrate
the advantages
of combining both FPGA and multiprocessor technologies
into an integrated system. The demonstration will
be based on two radar processing applications: SAR
(synthetic aperture radar) and STAP (space-time adaptive
processing). In particular, it will illustrate optimal
configuration, programming, and execution of the
prototype system for use in a "mixed-mode" setting
in which the same hardware is optimally configured
for instances within these two application domains.
RECENT ACCOMPLISHMENTS
Design, Implementation, And Trade-Off Study
Of FIR Filters On FPGAs
Filtering operations represent a significant amount
of the computational load associated with both SAR
and STAP. We have implemented and tested two distinct
approaches. One implements a serial multiply and
a parallel (reduction) add circuit. The other implements
a parallel multiplying circuit with a serial adder.
The advantage of the first include the ease and efficiency
with which it is placed and routed; a disadvantage
is its lack of scalability and modularity due structure
of the reduction tree used in the adder circuitry.
The second design is highly scalable, requiring little
effort to add more filter taps; a disadvantage, however,
is the presence of a high degree of signal fan-out
at the place and route level, thus causing inefficiencies
in power and space. We are investigating hybrid approaches
that attempt to capture the advantages of both designs
for large implementations spanning several chips
and boards.
Testing and Calibration of a Network Simulator
For Embedded Multiprocessor Systems
This simulator, which was implemented last year in
Java, is generic in that the parameters of the various
network objects and their interconnections can be
modified to accurately model different available
systems (from different manufacturers) as well as
possible future network designs. The simulator is
designed to model "phased" communication
patterns, i.e., communication patterns in which a
group of messages begin entering the network at about
the same time. Phased communication requirements,
which can result in performance bottlenecks if not
properly mapped and scheduled on the multiprocessor
system, are common to many embedded applications,
including STAP. The simulator will be used to aid
in optimally mapping and scheduling required communications,
thereby improving overall system efficiency. Currently,
the simulator is being applied to predict the communication
time associated with execution of the RT_STAP benchmark
from MITRE on a Mercury System computing platform.
Calibration Of A Probabilistic Power Prediction
Tool For FPGAs
The current version of this tool, which is implemented
in Java, is for Xilinx 4028 and 4036 FPGAs. The tool
requires the following two inputs: (1) the configuration
file for the FPGA for a given circuit design and
(2) a probabilistic characterization of the input
signals to the FPGA chip. The tool then computes
the activity (i.e., relative frequency) of every
internal signal of the FPGA design. The signals are
partitioned based on their physical length on the
FPGA chip. Each signal of a given length is assumed
to be driving a capacitance value that depends on
signal length. Calibrating the simulator involves
determination of these capacitance values based on
actual power measurements taken from the FPGA. Unfortunately,
our attempts to measure power have not been repeatable,
and thus reliable calibration has not yet been accomplished.
Optimal Mapping And Scheduling Techniques
For STAP
STAP involves three phases of processing for data
of a 3-dimensional data cube. At each phase of
processing, vectors of data along one dimension
of the data cube are processed. The manner in which
the vectors are mapped to processors, for each
phase of processing, affects the required communication
pattern necessary between computational phases.
Furthermore, the orders (i.e., schedules) used
in sending the queue of messages to their destinations,
from each processor, impacts network performance.
We have formulated a two-phase optimization approach
for mapping and scheduling for STAP. The first
phase of the optimization involves solving the
mapping problem by attempting to map the vectors
so that the resulting communications utilize a
minimal amount of the interconnection network hardware.
The second phase applies a Genetic Algorithm (GA)
approach to optimally schedule the message queues
at each processor. The network simulator, described
earlier, is used to estimate the communication
time associated with each scheduling considered
by the GA.
Implementations And Evaluation Of FPGA Inner
Product Co-Processor Designs
Inner product calculations are core calculations
associated with both direct and indirect techniques
for solving adaptive STAP weights. Two different
inner product circuits were designed and implemented,
each for a single FPGA. One design implements a "multiply-and-add" architecture
and the other implements a "multiply-and-accumulate" architecture.
The first design inputs four operands per cycle and
has two multipliers and one adder circuit. The second
design inputs two operands per cycle and has one
multiplier and one accumulator. Both designs are
heavily pipelined to increase the speed at which
they can operate. Each of the two basic designs were
implemented for two different data types: 16-bit
floating point and 16-bit integer. We have discovered
that although the first design contains more hardware
(two multipliers rather than one), its internal signal
lengths are much shorter than those of the second
design. This is probably because there is naturally
one direction of data flow in the first design, whereas
the second design involves feedback signals associated
with the accumulator. Assuming a linear model for
capacitance as a function of signal length, our power
prediction simulator indicates that the second design
does indeed consume more power, for a give data set,
than does the first.
CURRENT PLAN
Multiple Chip And Multiple Board Implementations
For FIR filters
Realistic systems may involve
filters with hundreds of taps. Thus, designs must
be able to span multiple
chips and boards of FPGAs.
Utilization Of Network Simulator For Optimal
Mapping And Scheduling For STAP.
The core of the scheduling component of the STAP
optimization requires that the accurate prediction
of communication times. Thus, after being thoroughly
tested and calibrated, the simulator will be used
for this purpose.
Calibration Of The Probabilistic Power Prediction
Tool For FPGAs.
We will continue in our efforts to obtain repeatable
power measurements for actual FPGAs. This is necessary
in order to properly calibrate our power prediction
tool.
Implementation Of Optimal Mapping And Scheduling
Techniques For STAP
Although the two-phase optimization technique has
been designed, we need to next implement this and
test the results. This will involve integration of
the network simulator for evaluation of the quality
of each scheduling technique considered.
Extensions Of Current Inner-Product Designs
For Complex Data
Our current implementations for inner-products on
FPGAs are for real integer and floating point data.
We will extend these designs for complex integer
and floating point data formats.
Implementation Of The Hybrid FPGA / Multiprocessor-Based
System.
We have already designed and architected
the hybrid FPGA /multiprocessor-based system. Now
that the hardware has arrived, we will begin interfacing
the FPGA-based boards with the Mercury multiprocessor
system and develop the prototype system.
|