logo

View all jobs

Site Reliability Engineer (SRE)

Cape Town, Western Cape · Information Technology
We are seeking a highly skilled Site Reliability Engineer (SRE) to design, build, and maintain scalable, highly available cloud-native systems. The ideal candidate will blend software engineering expertise with systems administration to automate operations, optimize performance, and ensure reliability across our production environments.
You will work at the intersection of development and operations, implementing SRE principles to reduce toil, improve system resilience, and enforce SLAs/SLOs. Strong experience with cloud platforms, container orchestration, observability tools, and infrastructure as code (IaC) is essential.

Key Responsibilities
Reliability Engineering
  • Define and enforce SLIs, SLOs, and SLAs for critical services
  • Implement error budgets and blameless postmortems
  • Design self-healing systems with automated failover capabilities
  • Conduct chaos engineering experiments (Gremlin, Chaos Monkey)
Cloud Infrastructure & Automation
  • Manage Kubernetes clusters (EKS, AKS, GKE, OpenShift)
  • Automate infrastructure using Terraform, Pulumi, or Crossplane
  • Build CI/CD pipelines (ArgoCD, Flux, GitHub Actions)
  • Optimize cloud costs with right-sizing and spot instances
Observability & Incident Management
  • Implement monitoring stacks (Prometheus, Grafana, Datadog)
  • Configure distributed tracing (Jaeger, OpenTelemetry)
  • Develop alerting policies to reduce noise
  • Lead incident response and participate in on-call rotations
Performance Optimization
  • Analyze latency, throughput, and resource utilization
  • Troubleshoot Linux kernel and network performance
  • Implement caching strategies (Redis, Memcached)
  • Optimize database queries (PostgreSQL, MongoDB)
Security & Compliance
  • Enforce security best practices (CIS benchmarks)
  • Manage secrets (Vault, AWS Secrets Manager)
  • Implement zero-trust networking

Required Skills & Qualifications
Technical Skills
 Core SRE Technologies:
  • Kubernetes, Docker, and service meshes (Istio, Linkerd)
  • Cloud platforms (AWS, GCP, Azure)
  • Infrastructure as Code (Terraform, Ansible)
 Programming & Scripting:
  • Python, Go, or Java for automation
  • Bash/PowerShell for operational tasks
 Observability Stack:
  • Prometheus, Grafana, ELK, OpenTelemetry
  • Distributed tracing tools
 Databases & Messaging:
  • SQL/NoSQL databases
  • Kafka, RabbitMQ, or NATS
 Networking & Security:
  • TCP/IP, DNS, Load Balancing
  • Firewalls, WAFs, and DDoS protection
Soft Skills & Experience
  • 3+ years in SRE, DevOps, or cloud engineering
  • Strong problem-solving under pressure
  • Excellent collaboration and documentation skills
  • Experience with incident management
Certifications (Preferred)
  • Google Professional SRE
  • AWS Certified DevOps Engineer
  • CKA (Certified Kubernetes Administrator)
  • Hashicorp Certified Terraform Associate

 

Share This Job

Powered by