#IkoKaziKE

Back to jobs
D

Senior Site Reliability Engineer

Deimos

full time Nairobi Posted 1 day ago

Requirements

  • Bachelor's degree in Computer Science, Information Technology, or a related field.

  • 5+ years of experience in Software Engineering, SRE, DevOps, or Platform Engineering, with demonstrable ownership of reliability standards at a team or company level.

  • Strong coding fluency: Proficiency in Python (or similar) with the ability to read, understand, reason about, and write production-grade automation code.

  • Cloud & IaC: Hands-on experience with AWS, and a solid understanding of Infrastructure as Code (Terraform or CloudFormation).

  • Deep Observability Knowledge: Demonstrable experience with monitoring tools (DataDog, Prometheus, ELK stack). Strong understanding of SRE concepts including Golden Signals, high-cardinality data handling, and error budget mathematics.

  • Systems Thinking: Strong grasp of designing for scale and resilience, including graceful failure, circuit breaking, connection pooling, and multi-AZ deployments.

  • Proven ability to define and drive reliability standards across multiple teams and drive a blameless post-mortem culture.

Enablement & RelOps Culture

  • Implement the Observability Ladder: Guide teams from basic monitoring to high-signal metric tracking. Work with product teams to define SLAs, SLIs, and SLOs, and build dashboards that track specific error budgets.

  • Empower Product Teams: Build frameworks and deployment tooling (e.g., CI/CD, internal tooling integrations) that allow teams to make data-driven decisions on deployment safety and automate rollbacks when error budgets are depleted.

  • Champion Reliability: Drive a blameless post-mortem culture focused on actionable takeaways, system improvements, and measurable metrics (MTBF, MTTR).

Frameworks & Automation

  • Standardised Alerting & On-Call: Continuously improve company-wide alerting and on-call frameworks to reduce alert fatigue, ensuring alerts are highly actionable and symptom-based.

  • Disaster Recovery: Drive evolution of DR strategies from manual processes into fully automated runbooks-as-code, allowing teams to prove and improve service recoverability through autonomous, evidence-based testing.

  • Eliminate Toil: Develop systems, automations, and tooling for pre- and post-deployment verification, ensuring our hands-off reliability vision becomes a production reality, via Python (or similar).

  • Reliability-as-Code: Lead the drive to manage our entire reliability suite through IaC. Use Terraform to architect, deploy, and configure our observability stack including ELK, Grafana, Loki, Prometheus, and Tracing.