For years, TU Darmstadt and RWTH Aachen University have successfully operated regional Tier-2 computers. Parts of both systems have been open to academic researchers from all over Germany for years. The goal is to combine HPC applications, algorithms and methods, and the efficient use of HPC hardware. This creates an infrastructure with which scientists can answer questions of central importance to the economy and society – whether in the field of engineering and materials sciences or engineering-oriented physics, chemistry or medicine.
The high-performance computing system CLAIX – Cluster Aix-la-Chapelle – at RWTH Aachen University is operated by the IT Center and currently consists of three parts: the Tier-2 part from the procurement phases 2016 and 2018 and the Tier-3 part from the procurement phase 2018.
CLAIX-2018 consists of over 1000 computing nodes with 2x Intel Xeon Skylake processors, each with 24 cores and 192 GB RAM. In addition, there are 48 computing nodes of identical architecture, each equipped with two NVIDIA Volta V100 GPUs (incl. NVLink) as accelerators and available for special applications such as machine learning.
The Tier-3 part consists of over 220 compute nodes with identical configuration (6 of those with GPUs) and are fully integrated into the overall cluster. CLAIX-2018 started test operation in November 2018 and since January 2019 the system has been available without restriction for use by computing time projects. More information is available here.
Lichtenberg II at TU Darmstadt
With over 56,000 cores in total, over 257 TB main memory, the first phase of TU Darmstadt‘s system Lichtenberg II has a performance peak of over 3.148 PFLOPS. In June 2020, the cluster was placed at #92 on the list of the world’s fastest HPC systems (TOP500), making it the fastest university cluster in Germany at that time. Moreover, on the Green500 list of the most energy-efficient systems, it was ranked to be the most energy-efficient German university cluster at #48. The operational concept utilizes direct hot-water cooling and the reuse of waste heat during the heating period (see EnEff:Stadt Campus Lichtwiese).
Lichtenberg II has an MPI section with dual-socket nodes, an MEM section with big-memory nodes in excess of 1 TB of memory, an ACC section with powerful accelerator nodes hosting at least two accelerators, and a test section. An additional ACC section containing Nvidia “DGX A100” nodes to support modern research in Artificial Intelligence is currently in preparation. The whole cluster is equipped with measurement equipment for improving energy efficiency of HPC applications.
The storage and compute systems are procured separately to ensure the best possible continuous storage service. As a first step, a Spectrum Scale system (formerly known as GPFS) with 3 PB of space was procured in 2018. The current first compute phase P1 of Lichtenberg II consists of 586 compute nodes with two Intel Cascade-Lake AP (96 cores SMP) processors and 384 GB main memory each, two big-memory nodes equipped with 1.5 TB main memory each, and 8 accelerator nodes with 2x Nvidia “V100” GPUs each. All nodes and the file systems are interconnected via EDR and HDR100 Infiniband.
The Lichtenberg installations reflect the multi-use concept of TU Darmstadt, which aims to procure each phase as homogeneously as possible, e.g. with comparable CPU, memory and network architectures, also for special purpose nodes. The goal is to reduce the hardware peculiarities a user has to keep track of, thus bringing down the number of mistakes at the batch level. Special purpose sections can be used for standard workloads when momentarily underutilized. Beyond convenience and flexibility, this principle provides economic savings with respect to rack space, reduced idle time, and system