At the 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART) last month, a research group led by Martin Schulz from the Leibniz Computing Center (Munich) presented a “position paper” in which they discussed the HPC architecture landscape of high-performance computing (HPC) is in a seismic shift.
“Future architectures,” they claim, “will have to provide a number of specialized architectures that enable a wide range of workloads, all under a strict energy cap. These architectures must be integrated into each node – as already in mobile and embedded systems – to avoid data movement between nodes or, even worse, between system modules when switching between accelerator types. “
The fact that HPC is changing can hardly be denied, and the authors – Martin Schulz, Dieter Kranzlmüller, Laura Brandon Schulz, Carsten Trinitis, Josef Weidendorfer – recognize many well-known constraints (end of Dennard’s scaling, sinking Moore’s law, etc.) and strike four guidelines before principles for the future of the HPC architecture:
- Energy consumption is no longer just a cost factor, but also a severe feasibility limit for systems.
- Specialization is the key to further increasing performance despite stagnating frequencies and within limited energy bands.
- A significant portion of the energy budget is devoted to moving data, and future architectures must be designed to minimize such data movement.
- Large data centers must provide optimal computing resources for increasingly differentiated workloads.
Your paper, About the inevitability of integrated HPC systems and how they will change HPC system operation, intervenes in each of the four areas. They find that integrated heterogeneous systems (interesting phrase) “are a promising alternative that integrate several specialized architectures on a single node, while the overall system architecture remains a homogeneous collection of mostly identical nodes. This allows applications to switch between accelerator modules quickly and with fine granularity, while energy costs and performance are minimized, which enables really heterogeneous applications. “
An essential part of achieving this integrated heterogeneity is the use of chiplets.
“Simple integrated systems with one or two specialized processing elements (eg with GPUs or with GPUs and tensor units) are already used in many systems. Research projects such as ExaNoDe are currently investigating integration with promising results. Several commercial chip manufacturers are also expected to go in this direction, ”the researchers write. “Currently and above all, the European Processor Initiative (EPI) is investigating an adaptable chip design that combines ARM cores with various accelerator modules (Figure 1). In addition, several groups are experimenting with clusters of GPUs and FPGAs within nodes, either for alternative workloads that target the respective architecture or to solve large parallel problems with algorithms that are mapped to both architectures. Future systems will likely drive this even further, aiming for closer integration and a greater variety of architectures, resulting in systems with more heterogeneity and flexibility in their use. “
This integrated approach is not without its challenges, the researchers agree: “[W]While it is easy to run a single application across the system – since the same type of node is available everywhere – a single application is unlikely to use all of the specialized computing elements at the same time, resulting in wasted processing elements. The choice of the most suitable accelerator mix is therefore an important design criterion in procurement, which can only be achieved through co-design between the data center and its users on the one hand and the system provider on the other. In addition, it is important to plan the respective computing resources dynamically and to supply them with electricity during runtime. With power overprovisioning, i.e. the planning of a TDP and maximum node performance, which is achieved with a subset of dynamically selected accelerated processing elements, this can be easily achieved, but requires new software approaches in system and resource management. “
They indicate the need for programming environments and abstractions to take advantage of the various on-node accelerators. “For widespread use, such support must be readily available and ideally uniform in a programming environment. OpenMP fits this with its architecture-independent target concept. Domain-specific frameworks, such as are common in KI, ML or HPDA (e.g. Tensorflow, Pytorch or Spark), will hide this heterogeneity further and make integrated platforms accessible to a wide range of users. “
In order to cope with the diversity of devices within nodes and the inevitable idle time between different devices, the researchers propose to develop “a new level of adaptivity in connection with dynamic planning of computational and energy resources in order to fully utilize an integrated system”. The core of this adaptive management approach, the proposal, is a feedback loop, as shown in Figure 2 below.
This adaptive approach is being investigated as part of the EU research project REGALE, which started in spring. REGALE uses measured information across all system layers and uses this information to control the adaptivity of the entire stack:
- Application level. Dynamically change application resources in terms of the number and type of processing elements.
- Node level. Changing node settings, e.g. B. Power / energy consumption using techniques such as DVFS or power capping as well as partitioning memory, caches, etc. at the node level.
- System level. Adjustment of system operation based on workloads or external inputs, e.g. B. Energy prices or supply levels.
The position paper can be read quickly and is best drawn up directly. Although the level and type of integration can vary, the researchers argue in their conclusion that such integration must be done on a node or even on a chip in order to: minimize and shorten expensive data transfers; allow fine-grained shifting between different processing elements running within a node; and enable applications to use the entire machine for scale-out experiments instead of just individual modules or sub-clusters of a particular technology. “
Only such an approach, they claim, will enable the development and deployment of large computing resources that can provide a diversified computing portfolio on a large scale and with optimal energy efficiency. We will see.
Link to the paper: https://dl.acm.org/doi/10.1145/3468044.3468046