Job Summary
We are seeking a skilled System Reliability Specialist to join our team. This role will be responsible for ensuring the reliability, availability and performance of systems and services.
You will work closely with development and operations teams to design, implement and maintain scalable and efficient infrastructure.
- Develop automation scripts to streamline system processes
- Monitor system health using various tools and dashboards
- Respond to incidents and outages performing root cause analysis and implementing corrective actions
Key Responsibilities :
System Monitoring and Incident Response :
Use data analytics to identify system performance issuesAnalyze system metrics to optimize performance and efficiencyPerformance Optimization :
Implement optimization strategies to enhance system efficiencyCollaborate with cross-functional teams to improve system architecture and designContinuous Improvement :
Maintain comprehensive documentation of systems processes and proceduresParticipate in post-mortem reviews to drive continuous improvementRequirements :
Proven experience in system reliability or similar fieldStrong knowledge of cloud computing platforms (AWS)Proficiency in scripting and programming languages (e.g Python, Go, Bash)