Site Reliability Engineer (SRE)

Cape Town, Western Cape · Information Technology

We are seeking a highly skilled Site Reliability Engineer (SRE) to design, build, and maintain scalable, highly available cloud-native systems. The ideal candidate will blend software engineering expertise with systems administration to automate operations, optimize performance, and ensure reliability across our production environments.
You will work at the intersection of development and operations, implementing SRE principles to reduce toil, improve system resilience, and enforce SLAs/SLOs. Strong experience with cloud platforms, container orchestration, observability tools, and infrastructure as code (IaC) is essential.

Key Responsibilities
Reliability Engineering

Define and enforce SLIs, SLOs, and SLAs for critical services
Implement error budgets and blameless postmortems
Design self-healing systems with automated failover capabilities
Conduct chaos engineering experiments (Gremlin, Chaos Monkey)

Cloud Infrastructure & Automation

Manage Kubernetes clusters (EKS, AKS, GKE, OpenShift)
Automate infrastructure using Terraform, Pulumi, or Crossplane
Build CI/CD pipelines (ArgoCD, Flux, GitHub Actions)
Optimize cloud costs with right-sizing and spot instances

Observability & Incident Management

Implement monitoring stacks (Prometheus, Grafana, Datadog)
Configure distributed tracing (Jaeger, OpenTelemetry)
Develop alerting policies to reduce noise
Lead incident response and participate in on-call rotations

Performance Optimization

Analyze latency, throughput, and resource utilization
Troubleshoot Linux kernel and network performance
Implement caching strategies (Redis, Memcached)
Optimize database queries (PostgreSQL, MongoDB)

Security & Compliance

Enforce security best practices (CIS benchmarks)
Manage secrets (Vault, AWS Secrets Manager)
Implement zero-trust networking

Required Skills & Qualifications
Technical Skills
✅ Core SRE Technologies:

Kubernetes, Docker, and service meshes (Istio, Linkerd)
Cloud platforms (AWS, GCP, Azure)
Infrastructure as Code (Terraform, Ansible)

✅ Programming & Scripting:

Python, Go, or Java for automation
Bash/PowerShell for operational tasks

✅ Observability Stack:

Prometheus, Grafana, ELK, OpenTelemetry
Distributed tracing tools

✅ Databases & Messaging:

SQL/NoSQL databases
Kafka, RabbitMQ, or NATS

✅ Networking & Security:

TCP/IP, DNS, Load Balancing
Firewalls, WAFs, and DDoS protection

Soft Skills & Experience

3+ years in SRE, DevOps, or cloud engineering
Strong problem-solving under pressure
Excellent collaboration and documentation skills
Experience with incident management

Certifications (Preferred)

Google Professional SRE
AWS Certified DevOps Engineer
CKA (Certified Kubernetes Administrator)
Hashicorp Certified Terraform Associate

Site Reliability Engineer (SRE)

Share This Job