Slurm User Group '24

12pm - 1pm

How to Build Schedule Blocks

C. Doe

2016 Mobile Ad Summit

Friday

September

7:00pm

SIGN ME UP

Full Name

Company & Title

Descriptive text about the speaker goes here.

12pm - 1pm

How to Build Schedule Blocks

C. Doe

Slurm User Group '24

11-13 September 2024

Sign Me Up

SLUG '24. The Slurm event of the year.

Slurm User Group 2024 | September 11 - 13 | University of Oslo

A welcome reception will take place the evening of Wednesday, September 11th on the top floor of Oslo Science Park
(Forskningsparken). Address: Gaustadalléen 21. There will be someone in the venue who can point you in the right direction.

Conference presentations will take place on September 12th and 13th in Auditorium 3 of Helga Engs Hus ("Helga Eng's House"). Address: Sem Sælands vei 7. Lunch each day will be in the canteen of that same building.

A dinner will be hosted on Thursday night at Olympen. The restaurant is off campus. Address: Grønlandsleiret 15, 0190 Oslo, Norway.

A map of the essential locations for SLUG (conference hall, dinner, etc.) can be found here.

We look forward to seeing you all soon!

Conference Agenda

WEDNESDAY, SEPTEMBER 11

18:00 – 20:00

Welcome Reception

Top floor: Oslo Science Park (Forskningsparken)

Address: Gaustadalléen 21.

THURSDAY, SEPTEMBER 12

University of Oslo

Helga Engs Hus - Auditorium 3

Address: Sem Sælands vei 7

9:00-10:00

Keynote

University of Oslo

10:00-10:30

Break

10:30-11:00

Site Report and Future Feature Discussion

Oak Ridge National Laboratory

11:00-11:30

Bringing in Robust, Memory-Driven Affinity to Slurm

Lawrence Livermore National Laboratory

11:30-12:00

Step Management

SchedMD

12:00-13:00

Lunch

Canteen in Helga Engs Hus

13:00-13:30

Site Report: Slurm at Jump Trading

Jump

13:30 – 14:00

The Evolution of Slurm at CSCS: From Monolithic Service to Multi-Tenant vService

CSCS

14:00-14:30

Slinky - Overview + Operator

SchedMD

14:30-15:00

Break

15:00-15:30

No-Touch Administration: Managing Slurm at Scale

ETH Zurich

15:30-16:00

TrailblazingTurtle: A Comprehensive Web Portal for Maximizing HPC Resource Utilization

Université Laval

16:00-17:00

Field Notes

SchedMD

18:30

Dinner

Olympen

Grønlandsleiret 15, 0190 Oslo, Norway

FRIDAY, SEPTEMBER 13

University of Oslo

Helga Engs Hus - Auditorium 3

Address: Sem Sælands vei 7

9:00-9:30

Leveraging Slurm and Cloud Architecture for Efficient Scientific Computing

Sandia National Laboratory

9:30-10:00

Gaining More Control Over Node Scheduling with the Topology/Block Plugin

NVIDIA

10:00-10:30

Break

10:30-11:00

Improving Job Throughput in HPC with Adaptive Time Limit Management

University of Basel

11:00-11:30

Site Report: Slurm on SuperMUC-NG at LRZ

Leibniz Supercomputing Centre (LRZ)

11:30-12:00

Slinky - Scheduling Plugin

SchedMD

12:00-13:00

Lunch

Canteen in Helga Engs Hus

13:00-13:30

Slurm Wiki and Tools from DTU

Technical University of Denmark

13:30-14:00

Maximizing HPC Efficiency for Ansys Simulations: Addressing Critical IT Concerns with Slurm

Ansys

14:00-14:30

Magic Castle: Canadian HPC as a Service

Digital Research Alliance of Canada

14:30-15:00

Break

15:00-16:00

Road Map + Open Forum

SchedMD

Getting Around

Attendees can fly into the Oslo Airport (OSL). This is the largest airport in Norway and great for international travelers.

From the airport, we recommend taking a train to Oslo Central Station, and then the subway ("T-bane") to where you want to go in Oslo. The closest subway station to the lecture hall is Blindern.

There is an airport train service to Oslo Central Station. Departures are usually every 10 minutes, and the ride takes about 20 minutes. The regular train service is cheaper and just as fast, but only has 2-3 departures per hour.

There are several airport bus lines going to different parts of Oslo.

Most places in Oslo are reasonably close to a subway station including the conference venue, welcome reception and
conference dinner restaurant. In addition to the subway, there are tram ("trikk") and bus lines. For schedules, prices and how to buy tickets, see here.

A map of the essential locations for SLUG (conference hall, dinner, etc.) can be found here.

--------------------

Accommodations

Below are some accommodation recommendations. All hotels are located downtown. Via subway, attendees can travel to the conference location, on campus, in about 10 minutes.

Citybox Oslo
Very central and close to subway. Low range prices.

Thon Hotel Europa
Downtown, Close to the Royal Castle and the subway. Mid range prices.

Scandic Holberg
Downtown, close to the Royal Castle and the subway. Mid range prices.

SLUG '24 Abstracts

Maximizing HPC Efficiency for Ansys Simulations: Addressing Critical IT Concerns with Slurm Resource Management and Scheduling

Ansys

Running Ansys simulation workloads in a High-Performance Computing (HPC) environment with Slurm effectively addresses three key HPC IT concerns: reducing design cycle times, adapting to new architectures, and better utilize existing HPC infrastructure. Slurm's robust resource allocation and job scheduling capabilities ensure efficient utilization of compute resources, reducing wait times and accelerating design cycles. By supporting heterogeneous systems and optimizing resource usage across nodes, CPUs, and GPUs, Slurm facilitates the adaptation to new architectures and accelerators, which is crucial as traditional hardware improvements plateau. Additionally, Slurm's accounting db - SlurmDBD - enables observability insights to improve utilization of existing HPC clusters, partitions, and nodes. Slurm combined with Ansys's highly scalable HPC solutions enables faster, more accurate engineering analyses across various disciplines like structural mechanics, fluid dynamics, thermal and electromagnetic analyses to promote customer accelerated success.

The Evolution of Slurm at CSCS: From Monolithic Service to Multi-Tenant vService

CSCS Swiss National Supercomputing Centre

The Swiss National Supercomputing Centre (CSCS) is commissioning its future geographically distributed flagship computing and storage infrastructure, codenamed Alps. This infrastructure is based on an HPE Cray EX system featuring the latest NVIDIA GH200 Grace-Hopper architecture.

CSCS is leveraging Cray System Management (CSM), which was originally intended for use within a single organization, to build a multi-tenancy software-defined infrastructure. This infrastructure is designed to serve multiple partners with varying scopes and requirements through versatile software-defined clusters (vClusters).

Slurm plays a crucial role in this organization. Initially provided as a single-cluster environment, Slurm was adapted to run as a multi-tenant service for each vCluster and later evolved using the vService concept.

vServices are designed to perform a specific task or closely related functionality within a vCluster. They are independently deployable and loosely coupled, allowing for independent development, updating, deployment, and maintenance.

The deployment of Slurm via vServices follows the original Cray organization, adopting a hybrid model. In this model, the control daemons are Kubernetes pods that manage bare metal nodes running slurmd. To enhance flexibility and decouple machine-oriented services on the CSM management plane from user-dedicated services, the Slurm control daemons of the vService can operate on any Kubernetes cluster.

By adopting a modern cloud-oriented micro-service approach, CSCS is now equipped with enhanced flexibility to deliver first-class High-Performance Computing (HPC) to a variety of users and partners.

Magic Castle: Canadian HPC as a Service

Digital Research Alliance of Canada

The Digital Research Alliance of Canada conducts over 150 workshops annually, aimed at equipping researchers from various disciplines with essential numerical skills and High-Performance Computing (HPC) techniques utilizing Slurm. To alleviate the resource strain on its production cluster, the Alliance has developed Magic Castle, an open source project based on Terraform that replicates the infrastructure and the software environment of an Alliance cluster in the cloud.

In this presentation, we will provide an overview of Magic Castle's design, emphasizing its capacity to seamlessly support all major commercial and community cloud providers through a unified user interface. Moreover, we will delve into how this design facilitated the implementation of an autoscaling logic that is entirely provider agnostic.

Following the design overview, we will delve deeper into the dynamic cluster configuration and how Magic Castle supports dynamic Multi-Instance GPUs (MIGs).

Finally, we will conclude this presentation by covering some emerging production use cases of Magic Castle, including the implementation of a self-service HPC cluster creation platform and the sequencing of cancer patients’ genomes in a secure cloud environment.

https://github.com/ComputeCanada/magic_castle

No-Touch Administration: Managing Slurm at Scale

ETH Zurich

As the operators of the central HPC cluster of a large research university, we experience a high turnover of users, shareholders, and hardware. To automatically manage the constantly changing user base on an evolving heterogeneous hardware platform, we combine an event- and database-driven configuration of users and accounts with a pure GitOps deployment of Slurm clusters.

To effectively serve over 200 diverse shareholders totalling over 4000 active users, we use a customer database to map people to Slurm accounts based on their affiliation and the types of nodes the shareholders have bought. Without filling out any forms or any action from us, anyone at the university can log in and their Slurm account will be created automatically on the fly with the correct Slurm account associations, so they can start computing without delay.

Our pure GitOps deployments of Slurm are defined by Helm charts and cluster-specific values that define the Slurm configuration. Aside from most production compute nodes, everything runs in containers on Kubernetes. This approach allows us to not only automatically test and stage changes to code and configuration, but also to create new or experimental clusters within minutes, either with containerized or real compute nodes, and all the user and account information present.

We will present our approach to user and account configuration, along with GitOps deployment, that allows us to serve our shareholders with no daily administration.

Slurm at Jump Trading

Jump

Jump Trading is a proprietary trading firm committed to world class research. Jump's HPC infrastructure hosts a large diversity of workloads that vary from high-throughput short jobs to long-lasting jobs, and from single core tasks to large parallel applications, all running on multiple clusters containing nodes with widely varying hardware configurations and capabilities.

A versatile scheduler capable of handling this heterogeneity in a uniform and optimal manner is a crucial prerequisite for users to leverage the capabilities and capacities of the hardware infrastructure.

This talk will provide an overview of Jump's HPC environment and Slurm migration, followed by a discussion of the challenges, optimisations, and enhancements surrounding Slurm's high-throughput capabilities. Jump and SchedMD are actively collaborating to close the gap between basic and fully-optimised high-throughput scheduling and highly sophisticated all-features-on scheduling under high-throughput constraints.

Bringing in Robust, Memory-Driven Affinity to Slurm

Lawrence Livermore National Laboratory

Computer systems are becoming increasingly complex and may include many-core technologies, hybrid machines with throughput-optimized cores and latency-optimized cores, and multiple levels of memory. The complexity involved in extracting the performance benefits these systems offer challenges the productivity of computational scientists greatly. A significant part of this challenge involves mapping

parallel applications efficiently to the underlying hardware. A poor mapping may result in dramatic performance loss. Furthermore, an application mapping is often machine-dependent breaking portability to favor performance.

In this talk, I will present mpibind and its Slurm SPANK plugin. mpibind is a memory-driven algorithm to map parallel hybrid applications to the underlying architecture transparently from the point of view of applications. This library employs a simple interface for computational scientists and results in a full mapping of MPI tasks, threads, and GPU kernels to hardware processing units and memory domains. Furthermore, scientists do not have to deal with intricate details of the hardware topology and thus increasing their productivity. With the SPANK plugin, using mpibind on Slurm-managed systems is straightforward and bridges the gap between performance and ease of use across computer architectures.

Site Report: Slurm on SuperMUC-NG at LRZ

Leibniz Supercomputing Centre (LRZ)

The Leibniz Supercomputing Centre (Leibniz-Rechenzentrum, LRZ) of the Bavarian Academy of Sciences and Humanities is the IT service provider for all Munich universities as well as a growing number of research organizations throughout Bavaria. In addition to this regional focus, LRZ also plays an important role as one of the members of the Gauss Centre for Supercomputing (GCS). To deliver top-tier HPC services on the national and European level LRZ operates a tier-0 supercomputer, Linux clusters and virtual machines for a wide variety of scientists’ requirements, including High Performance Computing, Cloud Computing, and Big Data.

LRZ has been operating world-class supercomputers for decades. The current supercomputer, the SuperMUC-NG, ranked at no. 8 of the most powerful computers in the world at the time of installation end of 2018 (today at no. 50). Peak performance of 26.9 Petaflops, 719 Terabytes main memory, 50 Petabytes external data storage, a high speed interconnect (Intel Omni-Path), and an innovative warm-water cooling system making it one of the most energy efficient supercomputers worldwide to provide first-class information technology for the users, i.e. researchers in the fields of e.g. physics, chemistry, life sciences, geography, climate research, and engineering. Users can access the in total 6480 compute nodes with more than 311,000 cores of SuperMUC-NG via the SLURM batch system. LRZ configured several partitions with different limits in size and run time to assure reasonable throughput and usage of the resources. The two SLURM controller servers are configured in a high availability set-up to assure constant and fail-safe usability of the system. To control and optimize the energy consumption of SuperMUC-NG the EAR (Energy Aware Runtime) plugin is used.

Gaining more control over node scheduling with the Topology/Block Plugin

NVIDIA

As multi-node applications evolve in AI and HPC, achieving peak performance requires physical alignment of task placement on the underlying architecture. In addition, new hardware architectures such as the NVIDIA GB200NVL system require more controls over task placement to ensure tasks are placed within the NVLink domains as expected by the application.

The new topology/block plugin, introduced in Slurm 23.11 and extended in Slurm 24.05, provides a new method for node selection by Slurm. The new plugin uses the configured topology as a job scheduling requirement, instead of just a suggestion.

Our talk will discuss the benefits of using the topology/block plugin for scheduling with the NVIDIA GB200NVL architecture and existing large-scale rail-optimized InfiniBand-based clusters. In addition, we will review some of the drawbacks of using this method including impacts on system utilization and challenges when nodes are unavailable.

ORNL Site Report and Future Feature Discussion

Oak Ridge National Laboratory

ORNL will give a short presentation on the current status of Slurm running at ORNL. This presentation will be followed by a discussion about some of the features that ORNL has funded. We will delve into the reasoning behind some of the specific requests, like the external step manager and federation improvements, and how they will solve specific scientific use cases to enable a better user experience at ORNL.

Leveraging Slurm and Cloud Architectures for Efficient Scientific Computing

Sandia National Laboratories

This presentation explores the concept of HPC as a Service and how it can be enhanced through the integration of Slurm scheduling and cloud architectures. This work by Sandia National Laboratories (SNL) will describe the integration of Slurm with cloud services via the Slurm RestAPI. This integration allows for the seamless extension of HPC clusters to cloud-native services, enabling sophisticated event-driven architectures that are able to leverage Slurm scheduling and resource management. We will explore how this integration facilitates the adoption of cloud event-driven application architectures, enabling scientific applications to respond to real-time data and events.

Slurm Wiki and tools from DTU

Technical University of Denmark

Based on years of experience at the DTU "Niflheim" site, we have written extensive Slurm Wiki pages [1] documenting our setup. The Wiki is a supplement to the official Slurm documentation and guides including detailed information for RHEL-based systems and links to the documentation and bugs database.

Highlights include the RPM build process, the Slurm upgrade process, an overview of configurations including health checks, power saving and monitoring, plugins, and network setup. We also describe database setup and backup, scheduler configuration, accounting, and administration of user limits.

Monitoring the status of partitions, compute nodes, jobs and user limits may be simplified using additional user-friendly tools. We present several locally developed tools [2], among others "pestat" which displays

a simple node and job state summary, and "notifybadjob" which sends notifications to users about badly behaving jobs.

[1] https://wiki.fysik.dtu.dk/Niflheim_system/SLURM/

[2] https://github.com/OleHolmNielsen/Slurm_tools

TrailblazingTurtle: A Comprehensive Web Portal for Maximizing HPC Resource Utilization

Université Laval

Effective utilization of high-performance computing (HPC) resources is crucial for maximizing cluster productivity. However, new users often struggle with inefficient job submissions, wasting valuable computational resources. To address this issue, we have developed TrailblazingTurtle, a comprehensive web portal and monitoring system for HPC clusters using Slurm. Built on open-source technologies such as Django and Prometheus, our solution provides real-time insights into the performance, resource utilization, and environmental impact of each job on the Slurm cluster.

The portal shows metrics collected by slurm-job-exporter and other sources of data and works on shared nodes, and even on shared GPUs using Multi-Instance GPUs (MIGs). Other metrics such as power consumption, Lustre IO, Infiniband metrics and the binaries running in the job are also collected and made available. The utilization metrics of each job are also used to generate automated feedback directly available to users to help them identify and address potential performance issues.

The portal also offers a user-friendly alternative to squeue and sacct to view, filter and search jobs submitted by a user. A direct access to the MySQL database is used to remove the load on the scheduler, while still being in realtime.

For support personnel, we have designed multiple global views, including the top users currently running on the clusters, with their allocated and actual usage, enabling analysts to focus their help where it is needed the most. Node-level views featuring Gantt charts and fault records from Slurm enable swift detection of performance issues if a problem is specific to a node.

By empowering both users and support staff with real-time insights into HPC cluster activity, TrailblazingTurtle aims to significantly improve job submission efficiency, reduce waste, and enhance the overall productivity of HPC resources.

Improving Job Throughput in HPC with Adaptive Time Limit Management

University of Basel

In High Performance Computing (HPC), it is important to maximize job throughput and productive resource utilization, and to minimize wasted resources due to job timeouts. Our goal is to increase the accuracy of time limits and prevent interrupting checkpoints or near-completion jobs due to timeouts. In this work we introduce a proof-of-concept for a feedback loop between the Slurm scheduler and iterative applications to dynamically adjust job time limits of running jobs.

The core concept involves iterative applications reporting their progress, through various metrics such as timesteps, execution or simulated time, or convergence metrics. A dedicated daemon ingests this progress data and estimates the application's completion time, and/or whether it is currently performing a checkpoint. Based on this estimation, the daemon modifies the job time limit in Slurm to align with the application's actual progress and estimated completion time.

This adaptive approach ensures that timeouts are adjusted to avoid interrupting checkpoints or terminating near-completion jobs. This reduces wasted computational progress and improves overall job throughput.

Dynamically adjusting job time limits based on real-time progress holds promise for enhancing the robustness and efficiency of job scheduling in HPC systems.

This work presents and seeks feedback on the technical aspects of the proposed feedback loop, including the mechanisms for progress reporting, the daemon's computational methods, and the integration with Slurm's scheduling capabilities.

Schedule

Wednesday evening, 11 September

Welcome Reception

Not required, ticket includes attendance

Thursday, 12 September

Technical Presentation Day 1

9:00 - 16:30, followed by group dinner

Friday, 13 September

Technical Presentation Day 2

9:00 - 16:00, concludes with roadmap presentation and Q&A

The University of Oslo is hosting Slurm User Group 2024. As the oldest university in Norway, and a leader in European education, the University of Oslo is committed to excellence in education, research, science communication and innovation. This is the perfect backdrop for our users to have educational and innovative conversations about Slurm.

About SchedMD

SchedMD is the core company behind Slurm distribution and maintenance. We are also the sole provider for Slurm support, development, training, and configuration, accelerating Slurm scheduling results with proven guest practices.

About Slurm

Slurm is an open-source workload manager designed to satisfy the demanding needs of high-performance computing, high-throughput computing, and AI. Slurm's automation capabilities simplify administration, accelerate job execution, and improve end user productivity, all while reducing cost and error margins.

Sign Me Up