Senior Site Reliability Engineer

hace 2 semanas


españa Hamilton Barnes ? A tiempo completo

Senior Site Reliability Engineer - EU Wide - Remote Join a stealth‑mode hyperscale data centre start‑up building an AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready for experimentation, full‑scale model training or inference. As a Senior Site Reliability Engineer, you’ll own the reliability, performance and automation of this GPU‑powered infrastructure, ensuring seamless orchestration across environments managed by Slurm, Kubernetes or direct SSH access. This is a rare opportunity to work at the intersection of hyperscale infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it. Interested in finding out more – Apply today Responsibilities Design, deploy, and maintain large‑scale GPU clusters (H100/H200/B200) for training and inference workloads. Build automation pipelines for provisioning, scaling and monitoring compute resources across Slurm and Kubernetes environments. Develop observability, alerting and auto‑healing systems for high‑availability GPU workloads. Collaborate with ML, networking and platform teams to optimise resource scheduling, GPU utilisation and data flow. Implement infrastructure‑as‑code, CI/CD pipelines and reliability standards across thousands of nodes. Diagnose performance bottlenecks and drive continuous improvements in reliability, latency and throughput. Skills / Must Have 7+ years of experience in SRE, DevOps or Infrastructure Engineering roles supporting large‑scale compute environments. Strong hands‑on experience with Kubernetes and Slurm for cluster orchestration and workload management. Deep knowledge of Linux systems, networking and GPU infrastructure (NVIDIA H100/H200/B200 preferred). Proficiency in Python, Go or Bash for automation, tooling and performance tuning. Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks. Familiarity with high‑performance computing (HPC) or AI/ML training infrastructure at scale. Background in reliability engineering, distributed systems or hardware acceleration environments is a strong plus. Seniority level: Mid‑Senior level Employment type: Full‑time Job function: Information Technology Location: The Hague, South Holland, Netherlands #J-18808-Ljbffr



  • españa ICEO - Venture Builder A tiempo completo

    Senior Site Reliability Engineer - Fintech Join to apply for the Senior Site Reliability Engineer - Fintech role at ICEO - Venture Builder As a Senior Site Reliability Engineer, you'll have a direct and influential role in shaping our organisation's reliability strategy and infrastructure. You'll proactively create robust solutions, implement best practices,...


  • españa ICEO - Venture Builder A tiempo completo

    Senior Site Reliability Engineer / DevOps As a Senior Site Reliability Engineer, you'll have a direct and influential role in shaping our organisation's reliability strategy and infrastructure. You'll proactively create robust solutions, implement best practices, and drive infrastructure excellence across all teams. Join us remotely; you can be located...


  • España ICEO - Venture Builder A tiempo completo

    Senior Site Reliability Engineer - FintechJoin to apply for the Senior Site Reliability Engineer - Fintech role at ICEO - Venture Builder As a Senior Site Reliability Engineer, you'll have a direct and influential role in shaping our organisation's reliability strategy and infrastructure. Join us remotely; you can be located anywhere in Europe within the...


  • españa RetailNext A tiempo completo

    Senior Site Reliability Engineer RetailNext is looking to expand our SRE team. We need people who have the skillset of good backend developers to focus on the operation and reliability of our SAAS retail analytics solution. We pull in and process data from thousands of brick‑and‑mortar stores to help our customers better understand and serve their...


  • españa Searchability® A tiempo completo

    Senior Site Reliability Engineer (SRE) – Barcelona (Hybrid) Key Points Barcelona-based hybrid role with a respected global organisation Azure-first SRE work across cloud, edge and on-premise platforms Terraform, GitHub Actions, Azure Arc, AKS, Datadog Salary up to €75,000 About the Client I’m supporting an established international organisation...


  • españa Lognext A tiempo completo

    En Lognext llevamos más de 18 años identificando e implementando soluciones tecnológicas prácticas que nos permitan seguir avanzando y optimicen nuestras operaciones, acompañando a los equipos con talento experto de alto rendimiento y haciendo de la tecnología una fuerza transformadora en nuestro día a día. Buscamos un Site Reliability Engineer (SRE)...

  • Site Reliability Engineer

    hace 2 semanas


    españa Goldavenue SA A tiempo completo

    Join us as our new Site Reliability Engineer (SRE) We’re looking for a pragmatic and detail-oriented Site Reliability Engineer to strengthen our technology team. In this role, you’ll help ensure the reliability, scalability, and performance of our platform across all environments. You’ll work closely with development, product, and operations teams to...


  • España The Contracts Management Group A tiempo completo

    You found an old site! Please login using the below link: Click Here **Senior Site Reliability Engineer - Remote**: **Are you excited by the prospect of working with innovative security products?** **Do you enjoy creating innovative and strategic solutions to solve complex problems?** **Join Guardicore (now Akamai Enterprise Security Group)** Are you...

  • Site Reliability Engineer

    hace 2 semanas


    españa CORUS Consulting A tiempo completo

    Site Reliability Engineer (SRE) At CORUS , we are looking for a Site Reliability Engineer (SRE) to join our IT Operations team on an international project. Location: 100% remote from Spain Objectives Measure and optimize system performance through the implementation and management of an appropriate monitoring system Ensure the availability and reliability of...


  • españa Datadope A tiempo completo

    En DataDope estamos transformando la forma en que las organizaciones entienden, monitorizan y gestionan sus sistemas a través de la observabilidad y el análisis inteligente de datos. Colaboramos con clientes líderes en sectores críticos, donde ayudamos a definir y desplegar oficinas de observabilidad que aportan un valor real al negocio. Estamos buscando...