Site Reliability Engineer

hace 2 semanas

españa Medium A tiempo completo

About Tinybird: At Tinybird, we help developers and data teams take flight by unlocking the power of real-time data to quickly build data pipelines and innovative data products. With Tinybird, you can effortlessly ingest multiple data sources at scale, query and shape it using the 100% pure SQL you already know and love, and publish results as low-latency, high-concurrency APIs for your applications to chirp about. Developers can create fast APIs, faster—what used to take hours and days now only takes minutes Tinybird is the essential tool that data engineers and software developers have been waiting for enabling you to drive innovation with ease. What you will be doing: We are looking for someone to help us scale and keep our software and infrastructure reliable and elastic as we scale. Someone who knows how to make hardware and software play together, and that participates as part of the on-call team, to understand not only our product, but also the issues our clients face. We run our stack in Linux. We try to keep things simple. Technologies we use: OpenResty: SSL termination and load balancing Varnish: load balancing and, sometimes, caching Redis: metadata store Python: most of our backend uses Python except some small bits that rely on C++ for hot paths ClickHouse: our main data store Zookeeper: for ClickHouse replicas coordination We use Grafana, Loki and Mimir for monitoring and alerting Terraform: Cloud provisioning (virtual machines, networks, Kubernetes clusters) Ansible: Deploys and software and config provisioning Kubernetes: the base of our infrastructure. Most of our infra runs on top of it, and autoscaling is key We operate a large-scale distributed system where efficiency matters. This is not just about automating infrastructure—it’s about building and evolving a self-service platform that makes the most of the underlying hardware, adapts to workload changes, and autoscale itself accordingly. You’ll be working closely with the product and backend teams to design the system architecture, optimize usage of resources, and make our platform more elastic and autonomous. You’ll need to understand how ClickHouse works under the hood and how to extract the best possible performance from it. Some challenges and things we want to improve: High availability and elasticity: As our customer base grows, the system must scale automatically and efficiently, without manual intervention. The platform should make capacity decisions on its own, in a transparent and safe way. Observability: From low-level resource usage to high-level service metrics. Good understanding of storage, networking, and compute is a must. Disaster recovery: Better tooling, better incident discovery, better on-call experience. What you bring: Experience designing, building and running distributed Cloud architectures and large scale web based applications. That is, in so many words, what you will be responsible for at Tinybird. Programming skills and willingness to dive into our codebase, ClickHouse source code, or any other software we use in order to figure out how things work. At Tinybird, we work mostly with Python and C++. Accountable and enthusiastic to take on the responsibility of designing and managing the platform, and an urge to take on things that may be broken. Unafraid to break stuff because you own it and can fix it if need be. Bias for action, iteration and delivery. Conscious that often decisions can be reversed quickly and that speed is of the essence in business and technology. That you think in terms of systems and you are attuned to edge cases, failure modes, behaviors, specific implementations. Comfortable collaborating and communicating asynchronously, but expect direct communication within the team on a daily basis. Build software with empathy, ensuring it's intuitive and maintainable. Document key insights and solutions to make it easy for everyone to understand and use without needing extensive documentation. Experience with OpenResty, Varnish, Redis, Terraform or Ansible would be great for you to get up and running quickly, but we don’t bring you here to tell you what the right technologies are: rather we expect you to recommend the right one for each challenge. Experience with ClickHouse and/or rolling out database systems at scale would be a huge plus. Deep expertise in Kubernetes : you should be comfortable designing and operating production‑grade clusters, writing custom controllers or operators when needed, and tuning autoscaling mechanisms (KEDA, Karpenter,…) to adapt to real‑time workloads. You should understand how Kubernetes handles networking, storage, scheduling, and resource management under the hood, and be able to reason about performance and failure scenarios at scale. Proficiency working with AWS and GCP cloud providers. Salary: €62,000 - €109,000 a year. We also offer: 22 days of holiday a year (plus your birthday and public holidays). Comprehensive health benefits. Freedom to work from wherever suits you best. We provide up to €2,400 to help you set up your home work space. #LI-Remote How We Work: We’re a fully remote company, committed to a remote-first culture. With offices in Madrid and New York City, we love face‑to‑face interactions, you can visit whenever it suits you As we’re in the early stages, your contributions will have a significant impact on everything we do. We believe in transparency, so you’ll always be in the loop about what’s happening. Check out our blog or follow us on LinkedIn to find out more about what’s important to us. #J-18808-Ljbffr

Site Reliability Engineer

hace 2 semanas

españa Datadope A tiempo completo

En DataDope estamos transformando la forma en que las organizaciones entienden, monitorizan y gestionan sus sistemas a través de la observabilidad y el análisis inteligente de datos. Colaboramos con clientes líderes en sectores críticos, donde ayudamos a definir y desplegar oficinas de observabilidad que aportan un valor real al negocio. Estamos buscando...
Senior Site Reliability Engineer

hace 2 semanas

españa RetailNext A tiempo completo

Senior Site Reliability Engineer RetailNext is looking to expand our SRE team. We need people who have the skillset of good backend developers to focus on the operation and reliability of our SAAS retail analytics solution. We pull in and process data from thousands of brick‑and‑mortar stores to help our customers better understand and serve their...
Senior Site Reliability Engineer

hace 6 días

españa Affirm A tiempo completo

Senior Software Engineer, Backend (Platform Reliability) Affirm is reinventing credit to make it more honest and friendly, giving consumers the flexibility to buy now and pay later without any hidden fees or compounding interest. We are seeking a Senior Software Engineer to join our Infrastructure Platform Engineering team, ensuring the reliability,...
Senior Site Reliability Engineer

hace 4 días

españa ICEO - Venture Builder A tiempo completo

Join to apply for the Senior Site Reliability Engineer role at ICEO - Venture Builder Full‑time, 100% remote position available to candidates located anywhere in Europe within the CET/CEST time zones. Overview As a Senior SRE you will shape our organization’s reliability strategy and infrastructure, proactively build robust solutions, implement best...
Site Reliability Engineer

hace 2 días

España Exoticca Travel UK Limited A tiempo completo

About the Company Exoticca is a pioneering online travel agency that has revolutionized the conception, production, and e-commerce of long-distance dream trips. At the core of Exoticca's brand equity is the commitment to "creating life milestones." We believe in delivering best-value trips, exploring unique destinations, curating extraordinary travel...
Site Engineer

hace 6 días

España Utopia Design A tiempo completo

About The Company Utopia is a forward‑thinking global real estate development group dedicated to creating a distinctive portfolio of luxurious and high‑end hospitality assets. Founded in 2023, Utopia Design – the company’s in‑house architectural firm – drives bold vision and innovative thinking in every project. About The Role Local Site Manager...
Senior SRE — Remote Cloud Reliability Leader

hace 1 semana

España ICEO - Venture Builder A tiempo completo

A leading tech company is seeking a Senior Site Reliability Engineer for a full-time, 100% remote position. You will shape the organization's reliability strategy, lead infrastructure development, and implement best practices. The ideal candidate has 5+ years of experience in a DevOps or SRE role, proficiency in programming languages like Python or Go, and...
Senior Site Reliability Engineer

hace 1 semana

España Affirm A tiempo completo

Affirm is reinventing credit to make it more honest and friendly, giving consumers the flexibility to buy now and pay later without any hidden fees or compounding interest. Affirm is looking for a Senior Software Engineer for our Cloud Compute team to play a pivotal role in ensuring the robust and scalable foundation of our entire platform. As a fully remote...
Senior Site Reliability Engineer Remote, Contract

hace 1 semana

España Affirm A tiempo completo

Affirm is reinventing credit to make it more honest and friendly, giving consumers the flexibility to buy now and pay later without any hidden fees or compounding interest. Affirm is looking for a Senior Software Engineer for our Cloud Compute team to play a pivotal role in ensuring the robust and scalable foundation of our entire platform. As a fully remote...
Data Centre Controls Engineer

hace 2 semanas

españa Linnk Group A tiempo completo

A leading global technology firm is seeking a Data Centre Controls Engineer to manage Building Management and Electrical Power Monitoring Systems. The role involves training teams, providing technical support, and developing project scopes. Candidates should have over 3 years of experience in industrial controls, particularly in critical environments. This...

América

Europa

Asia / Oceanía

África

Site Reliability Engineer