A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors

This post is part of my effort to document the research papers I have been reviewing related to Computer Architecture. Read more about this here.

Introduction

Non-monotonic cores are optimized for different level of ISA – same ISA but not easy to categorize performance broadly. Different phases of programs work best on different cores.

Problem:

  • Which cores to use in a design?
  • How to steer workload from core to core depending on workload type ?

Proposed answers:

  • On design time select the types of non-monotonic cores to implement. Use an ‘average core’ and rest of the cores are optimized for relieving particular bottlenecks.
  • On runtime, diagnose for bottlenecks on current core and move to a better core if needed.

Definitions

    • A multi-core architecture with multiple core types, which are functionally-equivalent but microarchitecturally diverse, is called a single-ISA heterogeneous chip multiprocessor (HCMP).
    • monotonic core types: they could be clearly ranked from high-performance/high-power to low-performance/low-power
    • non-monotonic core types: each core type is performance-optimized to different instruction-level behavior and hence cannot be ranked.
    • The performance metric is billions-of-instructions-per second (BIPS).
    • For power efficiency of steering algorithm, use BIPS3 /watt (voltage-independent metric).
    • Algorithm levels example
      • Optimal steering algorithm: The best core for the current 10K segment is the core used for running the next 10K segment of the application.
      • Oracle steering algorithm: For every 10K segment, the best core is used for running the application (this is not implementable for run time).

Types of Core Differentiation

These differences also correspond to the kinds of bottlenecks imposed by different cores.

  • Structure sizes (issue queue, load and store queues, physical register file / reorder buffer, caches, and predictors)
  • Pipeline stage widths
  • Clock Frequency

More Instruction Level Parallelism (ILP) requires circuitry results in lower clock frequency – basic tradeoff.

Cores selection

Requires design space exploration – different algorithms can be employed for search (genetic algo used in this work).

  1. Generating core design space (how to populate the design space to be explored): use FabScalar work – which provided capability for core generation, given various parameters (issue width, pipeline stages and types and so on). Moreover, achievable clock frequency, throughput and power estimation is also performed using the utilities provided in FabScalar toolset.
  2. Selecting ranges of parameters to be supplied to FabScalar. This effectively populates the design space with over 14000 points.
  3. Run genetic algo to search over the design space – not clear about the parameters specified for the genetic algo.
    1. Run 39 SimPoint phases benchmark on each generated core to characterize core performance.
    2. Take a set of these characterized cores (say 4)
    3. Take the best performance value for each phase amongst the 4 selected cores (this assumes perfect workload steering).
    4. Harmonic mean of these best performance values is the performance value of this set.

Notes on kinds of cores selected:

  • Average Core
  • LW = Large Window and wide pipeline
  • N = Narrow
  • L = Large Window
  • W = Wide Pipeline

Combinations of these core types, with their specific parameters are provided in the paper. Constraining Power gave similar results because frequency imposes similar constraints as power (less complex circuitry), hence power is already considered, by extension, in design space exploration.

Application Steering

The following algorithm and implementation method is proposed:

  • Performance counter are used to monitor bottleneck hits.
  • After an interval, the counters are checked if a bottleneck has passed a threshold.
  • An already saved bottleneck profile (positive and negative) of each available core is accessible.
  • Move to another core if the bottleneck is solved in that core. Make sure no other bottleneck will be introduced by the target core.

TABLE VI provides a list of counters used (very relevant). The method for qualitatively deriving bottleneck profile of a core is also given. It is pretty straightforward, linking the counter metrics to the core features which affect the counters. Moreover, consideration for OS interaction are provided; OS can keep scheduling control of some critical portion and enable application steering for the remaining threads.

TL;DR

Single-ISA heterogeneous chip multiprocessor (HCMP) implement the same ISA using different microarchitectures, each with its strengths and weaknesses for different scenarios. An extensive design space exploration has been performed to select cores types, given a set of benchmarks modelling various phases in application execution. Interestingly, one “average core”, which marks a general purpose processor, was always selected as one of the core types during design space exploration. Upon selection of the heterogenous cores comprising the microprocessor, a steering algorithm is proposed which used to steer workload to the best core at runtime, attempting to relieve bottlenecks, as measured by performance counters, presented by a particular core. The core type selection and steering algorithms is shown to give good results in power constrained settings as well.

References

Sandeep Navada, Niket K. Choudhary, Salil V. Wadhavkar, and Eric Rotenberg. 2013. A unified view of non-monotonic core selection and application steering in heterogeneous chip multiprocessors. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques (PACT ’13). IEEE Press, Piscataway, NJ, USA, 133-144.

Link: https://dl.acm.org/citation.cfm?id=2523743

One comment

Leave a Reply

Your email address will not be published. Required fields are marked *