# CprE / ComS 583 Reconfigurable Computing

Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University

Lecture #22 - Multi-Context FPGAs



#### Recap – Communication and Control

- · Need to signal between CPU and accelerator
  - Data ready
  - Complete
- · Implementations:
  - Shared memory
  - FIFO
  - Handshake
- If computation time is very predictable, a simpler communication scheme may be possible

November 7, 200

CprE 583 – Reconfigurable Computing

ect-22.3



#### ••• Outline

- Recap
- Multicontext
  - Motivation
  - Cost analysis
  - Hardware support
  - Examples

November 7, 2006

CprE 583 – Reconfigurable Computing

#### Single Context

- · When we have
  - Cycles and no data parallelism
  - Low throughput, unstructured tasks
  - Dissimilar data dependent tasks
- · Active resources sit idle most of the time
  - · Waste of resources
- · Why?

November 7, 2006

E 583 – Reconfigurable Computing











Resource Reuse
 Resources must be directed to do different things at different times through instructions
 Different local configurations can be thought of as instructions
 Minimizing the number and size of instructions a key to successfully achieving efficient design
 What are the implications for the hardware?















# ••• Scheduling Limitations

- N<sub>A</sub> (active)
  - Size of largest stage
- Precedence:
  - Can evaluate a LUT only after predecessors have been evaluated
  - Cannot always, completely equalize stage requirements

November 7, 2006 CprE 583 – Reconfigurable Computing Lect-22.20









#### Multicontext Scheduling

- · "Retiming" for multicontext
  - goal: minimize peak resource requirements
- NP-complete
- · List schedule, anneal
  - How do we accommodate intermediate data?
  - Effects?

November 7, 2006

CorE E83 Reconfigurable Computing

oct-22.25

#### ••• Signal Retiming

- · Non-pipelined
  - hold value on LUT Output (wire)
    - from production through consumption
  - Wastes wire and switches by occupying
    - For entire critical path delay L
    - Not just for 1/L'th of cycle takes to cross wire segment
  - How will it show up in multicontext?

November 7, 2006

CprE 583 - Reconfigurable Computing











#### General Throughput Mapping

- · If only want to achieve limited throughput
- · Target produce new result every t cycles
- Spatially pipeline every t stages
  - cycle = t
- · Retime to minimize register requirements
- · Multicontext evaluation w/in a spatial stage
  - Retime (list schedule) to minimize resource usage
- Map for depth (i) and contexts (c)

November 7, 2006

CprE 583 - Reconfigurable Computing

Lect-22.32

| Be      | nchmar      | k Set       |         |             |             |
|---------|-------------|-------------|---------|-------------|-------------|
| • 23    | MCNC ci     | rcuits      |         |             |             |
| • 1     | Area mapp   | ed with 9   | SIS ai  | nd Chartle  | ٠           |
| ,       | пса тарр    | Cu with t   | JIO ai  | ia Oriortic | ,           |
| Circuit | Mapped LUTs | Path Length | Circuit | Mapped LUTs | Path Length |
| 5xp1    | 46          | 10          | des     | 1267        | 13          |
| 9sym    | 123         | 7           | e64     | 230         | 9           |
| 9symm   | 108         | 8           | f51m    | 45          | 17          |
| C499    | 85          | 10          | misex1  | 20          | 6           |
| C880    | 176         | 21          | misex2  | 38          | 8           |
| alu2    | 169         | 19          | rd73    | 105         | 10          |
| арех6   | 248         | 9           | rd84    | 150         | 9           |
| арех7   | 77          | 7           | rot     | 293         | 16          |
| b9      | 46          | 7           | sao2    | 73          | 9           |
| clip    | 121         | 9           | vg2     | 60          | 9           |
| cordic  | 367         | 13          | z4ml    | 8           | 7           |
| count   | 46          | 16          |         |             |             |





## Reconfiguration for Fault Tolerance

- Embedded systems require high reliability in the presence of transient or permanent faults
- FPGAs contain substantial redundancy
- Possible to dynamically "configure around" problem areas
- · Numerous on-line and off-line solutions

November 7, 200

3 – Reconfigurable Computing

#### Column Based Reconfiguration

- · Huang and McCluskey
- Assume that each FPGA column is equivalent in terms of logic and routing
  - Preserve empty columns for future use
  - Somewhat wasteful
- Precompile and compress differences in bitstreams

November 7, 2006

CorF 583 - Reconfigurable Computing

# ••• Column Based Reconfiguration

- Create multiple copies of the same design with different unused columns
- Only requires different inter-block connections
- · Can lead to unreasonable configuration count





(a) (b) Figure 4: The overlapping precompiled configuration. (a) Base configuration. (b) Alternative configuration when column 3 is intentionally unused.

November 7, 2006

OprE 583 - Reconfigurable Computing

Lect-22.38

### ••• Column Based Reconfiguration

- Determining differences and compressing the results leads to "reasonable" overhead
- · Scalability and fault diagnosis are issues



November 7, 2006

CprE 583 – Reconfigurable Computing

ct-22.39

#### ••• Summary

- In many cases cannot profitably reuse logic at device cycle rate
  - · Cycles, no data parallelism
  - · Low throughput, unstructured
  - Dissimilar data dependent computations
- These cases benefit from having more than one instructions/operations per active element
- Economical retiming becomes important here to achieve active LUT reduction
- For c=[4,8], l=[4,6] automatically mapped designs are 1/2 to 1/3 single context size

November 7, 200

CprE 583 – Reconfigurable Computing

ect-22.40