We are seeking a highly skilled Site Reliability Engineer (SRE) to design, build, and maintain scalable, highly available cloud-native systems. The ideal candidate will blend software engineering expertise with systems administration to automate operations, optimize performance, and ensure reliability across our production environments. You will work at the intersection of development and operations, implementing SRE principles to reduce toil, improve system resilience, and enforce SLAs/SLOs. Strong experience with cloud platforms, container orchestration, observability tools, and infrastructure as code (IaC) is essential.
Key Responsibilities Reliability Engineering
Define and enforce SLIs, SLOs, and SLAs for critical services
Implement error budgets and blameless postmortems
Design self-healing systems with automated failover capabilities