







**Overview of Our Approach -- Customized Computing with Accelerator-Rich Architectures** 

- Extensive use of dedicated and composable accelerators
  - Most computations are carried on accelerators not on processors!
- A fundamental departure from von Neumann architecture
- Why now?
  - Previous architectures are device/transistor limited
  - Von Neumann architecture allows maximum device reuse
    - · One pipeline serves all functions, fully utilized
- Future architectures
  - Plenty of transistors, but power/energy limited (dark silicon)
  - Customization and specialization for maximum energy efficiency
- A story of specialization

## Lessons from Nature: Human Brain and Advance of Civilization

- High power efficiency (20W) of human brain comes from specialization
  Different region responsible for different functions
- Remarkable advancement of civilization also from specialization
  - More advanced societies have higher degree of specialization



## Intel's \$16.7B Acquisition of Altera (intel) June 1, 2015 Intel to Acquire Altera Enables New Classes of Products in High-Growth Data Center and Internet of Things Market Segments Combination Harnesses the Power of Moore's Law to Accelerate Altera's Existing Businesses Expected to be Accretive to Non-GAAP EPS and Free Cash Flow in First Year After Close SANTA CLARA, Calif. & SAN JOSE, Calif.-(BUSINESS WIRE)- Intel Corporation (NASDAQ: INTC) and Altera Corporation (NASDAQ: ALTR) today announced a definitive agreement under which Intel would acquire Altera for \$54 per share in an all-cash transaction valued at approximately \$16.7 billion. > Intel CEO Brian Krzanich noted, "The acquisition will couple Intel's leading-edge products and manufacturing process with Altera's leading field-programmable gate array (or FPGA) technology." He further stated, "The combination is expected to enable new classes of products that meet customer needs in the data center and Internet of Things market segments." FALCON CONFIDENTIAL 7









|                         | 2010              | 2013                        | 2015 (Today)                            |
|-------------------------|-------------------|-----------------------------|-----------------------------------------|
| CT image reconstruction | 18 hours          | 20 minutes                  | 6 minutes                               |
|                         | Single thread CPU | FPGA acceleration on Convey | 4 Virtex-6 FPGAs on Convey w/data reuse |
| Denoising               | 5 minutes         | 15 seconds                  | 3 seconds                               |
|                         | Single thread CPU | NVidia GPU                  | Core i7 Haswell, OpenMP, stencils       |
| Registration            | 10 minutes        | 2 minutes                   | 30 seconds                              |
|                         | Single thread CPU | NVidia GPU                  | Core i7 Haswell, OpenMP, stencils       |
| Segmentation            | 20 minutes        | 4 minutes                   | 1 minute                                |
|                         | Single thread CPU | Multithread CPU             | Core i7 Haswell, OpenMP, stencils       |
| Analysis                | 45 minutes        | 18 minutes                  | 5 minutes*                              |
|                         | Single thread CPU | Multithread CPU             | Core i7 Haswell, OpenMP                 |
|                         | 9                 |                             | accuracy                                |
| N N                     | Vorkstation       | CPU, GPU,                   | FPGA. CPU                               |









## Extensive Efforts on Improving Datacenter Energy Efficiency

- Understand the scale-out workloads
  - ISCA'10, ASPLOS'12
  - Mismatch between workloads and processor designs;
  - Modern processors are over-provisioning
- Trade-off of big-core vs. small-core
  - ISCA'10: Web-search on small-core with better energy-efficiency

17

Baidu taps Mavell for ARM storage server SoC



## <section-header> 5. Focus of Our Study 9. evaluation of different integration options of deterogeneous technologies in datacenters 9. efficient programming support for heterogeneous datacenters

















|                                               | e learning workloa         | ıds                      |  |  |  |  |
|-----------------------------------------------|----------------------------|--------------------------|--|--|--|--|
| ormalized performanc                          | 1 1 1 7 /                  |                          |  |  |  |  |
| performance/W) relative to big-core solutions |                            |                          |  |  |  |  |
|                                               |                            | -                        |  |  |  |  |
|                                               | Performance                | Energy-<br>Efficiency    |  |  |  |  |
|                                               |                            |                          |  |  |  |  |
| Big-Core+FPGA                                 | Best   2.5                 | Best   2.6               |  |  |  |  |
| Big-Core+FPGA<br>Small-Core+FPGA              | Best   2.5<br>Better   1.2 | Best   2.6<br>Best   1.9 |  |  |  |  |
| -                                             | · ·                        | · ·                      |  |  |  |  |





15















































| Design       | Merlin<br>Compiler | Initial<br>OpenCL | Manual Optimized<br>OpenCL |
|--------------|--------------------|-------------------|----------------------------|
| Blackschole  | 0.34ms             | 11ms              | NA                         |
| Denoise      | 0.08s              | 3.8s              | NA                         |
| LogisticRegr | 94ms               | 3.7s              | 94ms                       |
| MatMult      | 0.8ms              | 1.9ms             | 0.8ms                      |
| NAMD         | 26ms               | 51ms              | 26ms                       |
| Normal       | 4ms                | 52ms              | 10ms                       |
| TwoNN        | 1.23s              | 1.70s             | NA                         |
| Average      | 1x                 | 21x               | 1.3x                       |







