Site Reliability Engineer

hace 2 días


Madrid, Madrid, España Roche A tiempo completo

At Roche you can show up as yourself, embraced for the unique qualities you bring. Our culture encourages personal expression, open dialogue, and genuine connections, where you are valued, accepted and respected for who you are, allowing you to thrive both personally and professionally. This is how we aim to prevent, stop and cure diseases and ensure everyone has access to healthcare today and for generations to come. Join Roche, where every voice matters.

The Position

The role requires the candidate to be available for on-call duty service, responding promptly to urgent issues and emergencies outside of regular working hours, ensuring that critical situations are addressed in a timely and effective manner

Who We Are
At Roche, we are passionate about transforming patients' lives, and we are bold in both decision and action - we believe that good business means a better world. That is why we come to work every single day. We commit ourselves to scientific rigor, unassailable ethics, and access to medical innovations for all. We do this today to build a better tomorrow.

Roche is strongly committed to a diverse and inclusive workplace. We strive to build teams that represent a range of backgrounds, perspectives, and skills. Embracing diversity enables us to create a great place to work and to innovate for patients.

Roche is building a global site reliability engineering (SRE) team that will support commercial and internal solutions. This team will have the mindset of building and creating engineering solutions to solve a broad spectrum of problems.

Step into the Future of IT Infrastructure with Roche
As a seasoned Site Reliability Engineer (SRE) at Roche, you'll leverage your deep software engineering expertise to propel our IT infrastructure to new heights of robustness, scalability, and reliability. This isn't just a role—it's an invitation to shape the backbone of critical infrastructures and drive our technological innovations forward.

Your Mission
Design and maintain cutting-edge tools, scripts, and frameworks that automate repetitive tasks, streamline software deployment, and manage expansive systems with unparalleled efficiency.

Partner closely with forward-thinking development teams to architect and implement high-performance solutions that elevate system efficiency, optimize resource utilization, and enhance deployment processes for superior uptime and user satisfaction.

Your Impact
Lead the charge in incident management and response. Detect system anomalies, troubleshoot swiftly, and conduct thorough root cause analyses to prevent recurring issues.

Champion continuous improvement by refining monitoring and alerting mechanisms, conducting insightful post-incident reviews, and embedding best practices in software lifecycle management. Your strategic foresight and meticulous planning will ensure our systems are not only reliable but also superlatively performant.

By joining our elite team, you will play a pivotal role in delivering seamless experiences to our end-users, exceeding business and customer demands, and solidifying Roche's reputation as a leader in IT innovation.

Your Core Responsibilities

  • Reliability Mastery: Proactively monitor and maintain system reliability using advanced tools like DataDog, VictorOps, ELK, Grafana, and Prometheus. Become a key player in ensuring system stability and performance
  • Uptime Guardian: Ensure optimal uptime and performance by swiftly identifying issues and responding to alerts with precision
  • Technical Troubleshooter: Basic understanding of Architecture and designs to deep dive into complex technical issues, troubleshoot, investigate, and resolve them. Collaborate seamlessly with engineering teams to enable timely and effective resolutions
  • Service Excellence: Maintain and consistently achieve defined SLAs, SLIs, and SLOs, ensuring service levels are consistently met or exceeded
  • Automation Innovator: Develop and deploy automation scripts (using Python or other scripting languages) to streamline operations, enhance system efficiencies, and reduce manual tasks
  • Cloud Steward: Manage and maintain robust infrastructure across AWS and Azure environments, implementing best practices to ensure peak performance, reliability of cloud-based applications. Drive cost optimization through best practice implementation and continuous vigilance.
  • Cross-functional Collaborator: Work closely with engineering, DevOps, security and operations teams to drive continuous improvement and foster a culture of reliability and inclusion
  • Incident Responder: Handle requests and incidents through JIRA and ServiceNow, documenting troubleshooting procedures, solutions, and lessons learned to fuel ongoing improvements
  • Flexible Scheduling: Work on-call outside of normal working hours and weekends as scheduled to ensure continuous support
  • Team Builder: Actively contribute to the growth and development of the SRE team's capabilities, nurturing a stronger, more inclusive, and resilient team

Who You Are
:

  • Educational Background: Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent professional experience. An MBA or PhD is a plus, but not required
  • Certifications: Relevant industry certifications (AWS/Azure) to showcase your expertise
  • Experience: Approximately 5 years of experience in site reliability engineering, IT operations, DevOps, or related fields, or equivalent skills and experience
  • Cloud Expertise: Solid experience with AWS and/or Azure, including setting up, monitoring, and maintaining cloud resources (incl. Kubernetes, EKS, AKS, GKE, etc knowledge). Also experience on basis understanding of tools related to Infrastructure as a code, such as Terraform
  • Tool Proficiency: Proficiency with monitoring and logging tools such as DataDog, Splunk-Oncall, ELK stack, Grafana, and Prometheus etc. Knowledge of Loki Mimir and Tempo is a plus
  • Hands-On Skills: Hands-on experience with JIRA and ServiceNow for tracking incidents, requests, and documentation
  • Scripting Knowledge: Proficiency in Python or similar scripting languages for automation purposes
  • Incident Response: Understanding of SRE Core principles beside in-depth understanding of incident prioritization, escalation processes, and service level management (SLA/SLO/SLI)
  • Troubleshooting: Demonstrates proficient troubleshooting capabilities, especially in cloud and distributed system environments
  • Communication and Teamwork: Excellent communication, teamwork, and documentation skills, with a proactive and self-motivated approach to improving system reliability and operational efficiencies
  • Diversity and Inclusion: We value and encourage candidates from diverse backgrounds and experiences, believing that diverse perspectives drive innovation and success
  • Language requirements: Excelling in both spoken and written English communication

Why Join Us?
By joining our team, you will be part of a dynamic environment where your contributions will directly impact the resilience and reliability of our services. You will have opportunities for professional growth and the ability to collaborate with industry leaders. Let's drive the future of IT stability together, ensuring an exceptional experience for our customers.

Ready to make a difference? Apply now to be our next SRE Incident Manager and help us build a more reliable future

Who we are

A healthier future drives us to innovate. Together, more than 100'000 employees across the globe are dedicated to advance science, ensuring everyone has access to healthcare today and for generations to come. Our efforts result in more than 26 million people treated with our medicines and over 30 billion tests conducted using our Diagnostics products. We empower each other to explore new possibilities, foster creativity, and keep our ambitions high, so we can deliver life-changing healthcare solutions that make a global impact.

Let's build a healthier future, together.

Roche is an Equal Opportunity Employer.


  • Site Reliability Engineer

    hace 2 semanas


    Madrid, Madrid, España SIX A tiempo completo

    BME - Bolsas y Mercados Españoles - drives the transformation of financial markets and belongs to SIX, the third largest exchange group in Europe.What sets us apart drives us ahead: between local roots and global relevance, we are a unique blend of tradition and future, of foundation and growth. We value bright minds and inspire them to grow with their...


  • Madrid, Madrid, España ING A tiempo completo

    At ING we are looking for Site Reliability Engineer (SRE)Your role and work environment:We are looking for a talented Site Reliability Engineer to join our SRE Expert Unit.Our mission is to ensure the reliability and scalability of ING's platforms, delivering the best customer experience.We work closely with product teams to help them achieve their...


  • Madrid, Madrid, España ING A tiempo completo

    REQ 19/01/2026IT EngineeringMadrid, SpanjeING BankAt ING we are looking for Site Reliability Engineer (SRE)Your role and work environment:We are looking for a talented Site Reliability Engineer to join our SRE Epert Unit.Our mission is to ensure the reliability and scalability of ING's platforms, delivering the best customer eperience.We work closely with...


  • Madrid, Madrid, España ESW A tiempo completo

    The OpportunityESW is seeking a Senior Site Reliability Engineer (SRE) to join our growing SRE team and help scale and harden our global eCommerce platform.This is a hands-on senior role focused on reliability, scalability, cloud operations, and DevOps, with a strong emphasis on Microsoft Azure. You'll also work with AWS/GCP during our cloud transition and...


  • Madrid, Madrid, España ING A tiempo completo

    At ING we are looking for a Senior Site Reliability Engineer (SRE)Your role and work environmentWe are seeking a talented Senior Site Reliability Engineer to join our SRE Chapter.This team ensures the reliability, scalability, and security of ING's critical platforms, delivering the best customer experience.As a senior member, you will lead the evolution of...


  • Madrid, Madrid, España ING A tiempo completo

    REQ 19/01/2026IT EngineeringMadrid, SpanjeING BankAt ING we are looking for a Senior Site Reliability Engineer (SRE)Your role and work environmentWe are seeking a talented Senior Site Reliability Engineer to join our SRE Chapter.This team ensures the reliability, scalability, and security of ING's critical platforms, delivering the best customer eperience.As...


  • Madrid, Madrid, España CrowdStrike A tiempo completo

    As a global leader in cybersecurity, CrowdStrike protects the people, processes and technologies that drive modern organizations. Since 2011, our mission hasn't changed — we're here to stop breaches, and we've redefined modern security with the world's most advanced AI-native platform. We work on large scale distributed systems, processing almost 3...


  • Madrid, Madrid, España Kyndryl A tiempo completo

    Who We AreAt Kyndryl, we design, build, manage and modernize the mission-critical technology systems that the world depends on every day. So why work at Kyndryl? We are always moving forward – always pushing ourselves to go further in our efforts to build a more equitable, inclusive world for our employees, our customers and our communities.The RoleJoin us...


  • Madrid, Madrid, España fiskaly A tiempo completo

    Job DetailsLocation: Vienna (Preferred) or Remote (Italy, Germany, Spain)Stack: GCP, Kubernetes, Terraform, Go, Prometheus, ELKSalary: €64,000–€80,000 gross/year (Austria base)Join fiskaly Help Us Build Trust in Compliance & SecurityMillions of people interact with fiskaly every day, even if they don't realize it. As a B2B SaaS company, we power the...

  • Site Reliability Engineer

    hace 2 semanas


    Madrid, Madrid, España CrowdStrike A tiempo completo

    As a global leader in cybersecurity, CrowdStrike protects the people, processes and technologies that drive modern organizations. Since 2011, our mission hasn't changed — we're here to stop breaches, and we've redefined modern security with the world's most advanced AI-native platform. We work on large scale distributed systems, processing almost 3...