Slurm Administration & Systems Architecture (Santa Rosa) Job at Midjourney, Santa Rosa, CA

MTdHempCTk04K2JPaFJxSFpEeEdKaVIxRmc9PQ==
  • Midjourney
  • Santa Rosa, CA

Job Description

Overview

We are seeking a highly skilled HPC/AI/ML Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Slurm. This role centers on cluster engineering, administration, and performance optimization, with emphasis on GPU-accelerated computing, advanced networking, and workload scheduling. In this role, you will work closely with our researchers, vendors, and partners to manage Slurm clusters that are used for AI/ML workloads.

Responsibilities

Cluster Engineering & Deployment

  • Participate in the design and bring-up of bare metal HPC/AI/ML environments
  • Architect compute node definitions (NUMA, GRES GPU topologies, CPU pinning) and Slurm partitioning strategies for diverse workloads.
  • Integrate heterogeneous hardware platforms into cohesive scheduling environments.
  • Develop provisioning and imaging workflows (Ansible, MAAS, cloud-init, CI/CD pipelines) for reproducible cluster build-out.
  • Coordinate communications between vendors, researchers, and other partners during cluster bring-up and operation.

Slurm Management

  • Configure and operate the Slurm Workload Manager.
  • Build custom Slurm plugins and scripts (epilog/prolog, pam_slurm_adopt) to extend functionality and integrate with authentication,and monitoring.
  • Manage federated Slurm setups across multi-site or hybrid cloud environments.

System Administration & Monitoring

  • Administer Linux HPC environments, including network configuration, storage integration, and kernel tuning for HPC workloads.
  • Deploy and maintain observability stacks for system health, GPU metrics, and job monitoring.
  • Automate failure detection, node health checks, and job cleanup to ensure high uptime and reliability.
  • Manage security and access control (LDAP/SSSD, VPN, PAM, SSH session auditing).

User & Stakeholder Support

  • Assist cluster users with developing workflows that make efficient use of compute resources.
  • Containerize HPC applications with Docker/Podman/Enroot-Pyxis and integrate GPU-aware runtimes into Slurm jobs.
  • Automate cost accounting and cluster usage reporting.

Qualifications

  • 7+ years experience in HPC cluster administration and engineering, with deep knowledge of Slurm.
  • Familiarity with common AI/ML software package dependencies and workflows
  • Expert in Slurm configuration, partition design, QoS/preemption policies, and GRES GPU scheduling.
  • Strong background in Linux system administration, networking, and performance tuning for HPC environments.
  • Hands-on experience with parallel file system, advanced networking (InfiniBand, RoCE, 100/200 GbE), and monitoring stacks.
  • Proficient with automation tools (Ansible, Terraform, CI/CD pipelines) and version control.
  • Demonstrated ability to operate GPU-accelerated clusters at scale.

Job Tags

Part time,

Similar Jobs

Avita Health System

Box Truck Delivery Driver Job at Avita Health System

Supply Chain Technician II - Box Truck Delivery DriverJoin to apply for the Supply Chain Technician II role at Avita Health System .Avita Health System is proud to serve the communities of Crawford and Richland counties through three hospitals and numerous clinic locations... 

Dirt Dynamics

Concrete Pump Operator Job at Dirt Dynamics

 ...Dirt Dynamics is looking to hire a reliable, motivated individual for our Concrete Pump/Conveyor Belt Operator position. This position is responsible for operating concrete pumps and conveyor belts in the Fargo-Moorhead area on job sites safely and efficiently. Must have... 

Manpower

Textile Machine Operator - 1st, 2nd, 3rd Shift Job at Manpower

 ...Job Description Job Description Our client is seeking aTextile Machine Operator to join their team. The ideal candidate is willing to learn, is a team player and has great attendance which will align successfully in the organization. Job Title: Textile Machine... 

HouseSitter.com

House Sitter Wanted - Seeking Trustworthy House And Dog Sitter In Leavenworth, Wa $25 Daily Job at HouseSitter.com

Hi there! I'm looking for a reliable and caring house sitter to take care of my home and pets while I'm away. I live in beautiful Leavenworth, Washington, and I need someone who can help with pet feeding and general house care. The daily rate for this position is $25. If... 

Information Systems Solutions, Inc.

Technical Editor/Writer with Security Clearance Job at Information Systems Solutions, Inc.

Specific duties include, but are not limited to the following: Provide technical, administrative, and operational leadership to assigned project or tasks. DoD and DA technical documentation development, parlance, and regulatory guidance. Working knowledge and user...