Senior Site Reliability Engineer

hace 7 días

españa Hamilton Barnes ? A tiempo completo

Senior Site Reliability Engineer - EU Wide - Remote Join a stealth‑mode hyperscale data centre start‑up building an AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready for experimentation, full‑scale model training or inference. As a Senior Site Reliability Engineer, you’ll own the reliability, performance and automation of this GPU‑powered infrastructure, ensuring seamless orchestration across environments managed by Slurm, Kubernetes or direct SSH access. This is a rare opportunity to work at the intersection of hyperscale infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it. Interested in finding out more – Apply today Responsibilities Design, deploy, and maintain large‑scale GPU clusters (H100/H200/B200) for training and inference workloads. Build automation pipelines for provisioning, scaling and monitoring compute resources across Slurm and Kubernetes environments. Develop observability, alerting and auto‑healing systems for high‑availability GPU workloads. Collaborate with ML, networking and platform teams to optimise resource scheduling, GPU utilisation and data flow. Implement infrastructure‑as‑code, CI/CD pipelines and reliability standards across thousands of nodes. Diagnose performance bottlenecks and drive continuous improvements in reliability, latency and throughput. Skills / Must Have 7+ years of experience in SRE, DevOps or Infrastructure Engineering roles supporting large‑scale compute environments. Strong hands‑on experience with Kubernetes and Slurm for cluster orchestration and workload management. Deep knowledge of Linux systems, networking and GPU infrastructure (NVIDIA H100/H200/B200 preferred). Proficiency in Python, Go or Bash for automation, tooling and performance tuning. Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks. Familiarity with high‑performance computing (HPC) or AI/ML training infrastructure at scale. Background in reliability engineering, distributed systems or hardware acceleration environments is a strong plus. Seniority level: Mid‑Senior level Employment type: Full‑time Job function: Information Technology Location: The Hague, South Holland, Netherlands #J-18808-Ljbffr

Senior Site Reliability Engineer

hace 7 días

españa Trust In Soda A tiempo completo

Senior Site Reliability Engineer | Spain (Hybrid) Se pueden requerir diversas habilidades interpersonales y experiencia para el siguiente puesto. Por favor, asegúrese de consultar la descripción a continuación con atención. An opportunity to join a high growth, late stage technology company operating at significant scale. The business supports thousands...
Senior Site Reliability Engineer

hace 1 día

españa RetailNext A tiempo completo

Senior Site Reliability Engineer RetailNext is looking to expand our SRE team. We need people who have the skillset of good backend developers to focus on the operation and reliability of our SAAS retail analytics solution. We pull in and process data from thousands of brick‑and‑mortar stores to help our customers better understand and serve their...
Site Reliability Engineer

hace 4 días

españa Datadope A tiempo completo

En DataDope estamos transformando la forma en que las organizaciones entienden, monitorizan y gestionan sus sistemas a través de la observabilidad y el análisis inteligente de datos. Colaboramos con clientes líderes en sectores críticos, donde ayudamos a definir y desplegar oficinas de observabilidad que aportan un valor real al negocio. Estamos buscando...
Site Reliability Engineer

hace 7 días

españa Okta for Developers A tiempo completo

Get to know Okta Okta is The World’s Identity Company. We free everyone to safely use any technology, anywhere, on any device or app. Our flexible and neutral products, Okta Platform and Auth0 Platform, provide secure access, authentication, and automation, placing identity at the core of business security and growth. Okta is The World’s Identity...
Site Reliability Engineer

hace 2 semanas

españa Capitole A tiempo completo

Overview Capitole sigue creciendo y queremos hacerlo contigo. Buscamos un/a Site Reliability Engineer (SRE) para unirse a nuestro equipo internacional y garantizar la estabilidad, escalabilidad y mejora continua de nuestras plataformas tecnológicas. El rol requiere un perfil muy técnico y con visión transversal . Responsibilities Bases de datos : SQL,...
Senior Reliability Engineer

hace 2 semanas

españa Uplift People Consulting A tiempo completo

Our client, a global leader in predictive maintenance and industrial reliability solutions , is looking for an experienced Senior Reliability Engineer to support major industrial customers across Europe. In this role, you will lead reliability improvement initiatives for rotating equipment , deploy advanced condition monitoring technologies , and help...
Site Reliability Engineer

hace 1 semana

españa Nextiva Inc. A tiempo completo

Redefine the future of customer experiences. One conversation at a time. We’re changing the game with a first-of-its-kind, conversation-centric platform that unifies team collaboration and customer experience in one place. Powered by AI, built by amazing humans. Our culture is forward-thinking, customer-obsessed and built on an unwavering belief that...
Senior Site Reliability Engineer

hace 1 semana

españa Affirm A tiempo completo

Affirm is reinventing credit to make it more honest and friendly, giving consumers the flexibility to buy now and pay later without any hidden fees or compounding interest. Affirm is looking for a Senior Software Engineer for our Cloud Compute team to play a pivotal role in ensuring the robust and scalable foundation of our entire platform. As a fully remote...
Site Engineer

hace 1 semana

españa Utopia Design A tiempo completo

About The CompanyUtopia is a forward‑thinking global real estate development group dedicated to creating a distinctive portfolio of luxurious and high‑end hospitality assets. Founded in 2023, Utopia Design – the company’s in‑house architectural firm – drives bold vision and innovative thinking in every project. About The Role Local Site Manager...
Site Reliability Engineer

hace 1 semana

españa Exoticca A tiempo completo

Overview What is Exoticca? Exoticca is a pioneering online travel agency that has revolutionized the conception, production, and e-commerce of long-distance dream trips. At the core of Exoticca's brand equity is the commitment to "creating life milestones." We believe in delivering best-value trips, exploring unique destinations, curating extraordinary...

América

Europa

Asia / Oceanía

África

Senior Site Reliability Engineer