Overview
Site Reliability Engineer (Middle) ID38916 — AgileEngine
Join to apply for the Site Reliability Engineer (Middle) ID38916 role at AgileEngine.
Responsibilities
- Shift : Monday – Thursday 8AM – 7PM PST (11AM – 10PM EST) with rotating on-call.
- Manage alerts daily, check systems, and escalate issues as needed.
- Be part of a team that provides 24×7 on-call support for critical SaaS events.
- Be available in case of emergencies when team members are not available or need help.
- Document issues and remediation steps.
- Proactively create appropriate monitors in the EKS / K8S ecosystem.
- Deploy to EKS / K8s cluster using Terraform and Helm.
- Learn and maintain existing infrastructure running under Docker Swarm.
- Improve existing infrastructure health by implementing checks and scripts to correct known issues.
- Maintain and develop deployment code.
- Automate manual tasks.
- Implement / integrate new technologies in our Cloud Infrastructure.
- Collaborate with other teams and departments to provide the highest level of support and assistance.
- Apply a real customer focus when planning deployments / updates, having the customer in the forefront of the mind, and considering the impact on them before making changes.
- Work closely with Support, Customer Success, Migration, and Professional Services teams to provide the best-in-class SaaS service to customers.
- Perform RCA and take necessary corrective actions to prevent recurrence of issues.
- Create and assign alert-related actions to the appropriate team after the investigation.
- Handle support requests for environment-specific actions.
- Identify and provide automation requirements to improve RCA.
Qualifications
MUST HAVES2+ years of professional experience.Experience working with Datadog.Hands-on experience as an AWS Cloud Engineer.Working knowledge of EKS / Terraform / Helm.Working experience with Docker and Docker Swarm.Good understanding of AWS IAM roles and policies.Experience logging and monitoring AWS resources using CloudWatch logs.Experience working in a Linux environment.Proficient in Bash and / or Python scripting.A strong understanding of web technologies such as REST APIs.Working experience with monitoring solutions such as Grafana and Prometheus.Excellent oral and written communication skills; customer-facing communication skills to explain issues and RCAs.Experience in Product / Application Support for SaaS-based products.Understanding of APIs, Databases, Systems Architecture, and Design.Designing, implementing, and operating in a DevSecOps environment.Excellent communication skills, both written and verbal.Ability to work independently as well as in a collaborative environment.Technical aptitude with the desire to learn new and evolving technologies.Upper-Intermediate English level.NICE TO HAVESExperienceBenefits
Professional growth : mentorship, TechTalks, and personalized growth roadmaps.Competitive compensation : USD-based compensation with budgets for education, fitness, and team activities.A selection of exciting projects : modern solutions development with Fortune 500 enterprises and leading product brands.Flextime : flexible schedule, including options for remote work or office-based work.Your application doesn't end here! To unlock the next steps, check your email and complete your registration on our Applicant Site. An incomplete registration may terminate your process.
#J-18808-Ljbffr