As part of the Site Reliability Engineering (SRE) team, you’ll contribute to designing, automating, and evolving mission-critical systems. You'll combine deep systems expertise with modern software engineering practices to reduce operational toil and build resilient, self-healing services.
This is a high-impact role where your work directly affects the reliability of cloud services used by thousands of customers around the world.
Qualifications
Career Level - IC4
Responsibilities
What You’ll Do :
- Collaborate with SRE and development teams to ensure end-to-end reliability across a wide range of services and technology stacks.
- Design, write, and deploy software and automation tools that enhance availability, observability, and scalability.
- Own and evolve metrics, SLOs, SLAs, KPIs, and dashboards that track system health and customer experience.
- Build tooling to reduce manual operations and eliminate sources of toil.
- Improve CI / CD pipelines, deployment processes, and validation frameworks for reliability and efficiency.
- Review and influence architectural designs for distributed systems with a focus on resilience, performance, and fault tolerance.
- Lead and participate in post-incident reviews, capacity planning, and production-readiness assessments.
- Provide on-call support on a rotational basis (12-hour shifts, 7-day coverage).
What We’re Looking For :
Advanced Linux systems administrationStrong coding skills in Python (automation-focused)Intermediate experience with Bash / Shell scriptingFamiliarity with networking principles and distributed systems behaviorBasic to intermediate knowledge of databases (e.g., SQL, NoSQL)Understanding of unit testing and modern software engineering practicesExperience with CI / CD pipelines and deployment automationComfortable working in Agile development environmentsNice to Have :
Exposure to monitoring / observability tools (e.g., Prometheus, Grafana, New Relic)Experience building internal tools for operational efficiencyParticipation in SRE culture : blameless postmortems, runbooks, and service design reviews#J-18808-Ljbffr