We are seeking an experienced and driven SRE Manager to lead our Site Reliability Engineering team. This role is critical to ensuring the availability, scalability, and performance of our production systems. As the SRE Manager, you will be responsible for managing a team of engineers focused on building automation, enhancing monitoring and observability, improving system reliability, and fostering a culture of operational excellence. You will work closely with development, infrastructure, and security teams to support high-quality product delivery with minimal downtime.

Key Responsibilities:

  • Lead and grow a high-performing SRE team responsible for the reliability, performance, and scalability of production systems.

  • Own the incident management process, postmortems, and root cause analysis to improve system resilience.

  • Drive implementation of SLAs, SLOs, and error budgets across services to align operational goals with business objectives.

  • Champion the use of automation to reduce manual work and improve deployment and recovery times.

  • Collaborate with software engineering and DevOps teams to ensure systems are designed for reliability and operational efficiency.

  • Oversee system monitoring, alerting, and observability efforts using tools like Prometheus, Grafana, Datadog, or similar.

  • Manage on-call rotations, and ensure proper documentation, runbooks, and playbooks are maintained.

  • Identify and drive continuous improvement in system architecture, capacity planning, and deployment strategies.

  • Ensure compliance with security, privacy, and regulatory requirements within the infrastructure.

  • Provide mentorship, performance reviews, and career development opportunities for SRE team members.

  • Qualifications:

    • Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent experience).

    • 4+ years of experience in software engineering, DevOps, or SRE roles.

    • Strong experience with cloud platforms (AWS, GCP, or Azure) and infrastructure-as-code tools (Terraform, Pulumi, etc.).

    • Proficient in programming/scripting languages such as Python, Go, Javascript.

    • Deep understanding of Linux systems, networking, and container orchestration (Kubernetes, Docker).

    • Strong knowledge of CI/CD pipelines and release automation.

    • Excellent leadership, communication, and project management skills.

    • Proven track record of building reliable systems at scale and managing incident response in production environments.