6 days ago Be among the first 25 applicants
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
Join our team as a Lead Site Reliability Engineer dedicated to providing advanced support for critical Azure-based systems.
You will address complex cloud challenges, enhance system observability, and strengthen reliability using Kubernetes, monitoring platforms, and Infrastructure-as-Code. If cloud reliability excites you and collaboration across teams inspires you, apply now to contribute to our innovative projects.
Responsibilities
- Resolve complex incidents to ensure system availability
- Maintain reliability and performance of Azure-based enterprise infrastructure
- Deploy observability, monitoring, and logging tools
- Automate infrastructure management with Terraform and scripting technologies
- Improve system performance and uptime through centralized monitoring
- Collaborate with multiple teams to enhance service reliability
- Perform root cause analysis and oversee postmortems for incidents
- Configure deployment pipelines in Azure DevOps for secure workflows
- Write and maintain automation scripts for incident recovery and recurring tasks
- Enhance monitoring frameworks with platforms like Prometheus and Grafana
- Respond promptly to incidents to meet SLA expectations
- Facilitate integration of monitoring data from Azure and AWS environments
- Advance service reliability and observability practices continuously
- Document processes and incident resolutions thoroughly
- Take part in Agile team events and balance task priorities
Requirements
Minimum 5 years’ expertise in site reliability engineering or comparable DevOps roles1+ years of demonstrated leadership experienceKnowledge of Azure services, including AKS, Azure Monitor, Application Insights, Log Analytics, Cosmos DB, and PostgreSQLExpertise in infrastructure automation using Azure DevOps and TerraformProficiency in scripting languages such as Bash, PowerShell, and PythonSkills in monitoring tools including Prometheus and GrafanaBackground in incident management and ITSM processes with analytical capability for root cause investigationsCompetency in resolving technical challenges promptly in high-pressure situationsExperience in Agile workflows and fast-paced operational environmentsFlexibility to communicate effectively in written and verbal formats for teamwork and documentationCapability to configure alerts that prevent SLA breaches proactivelyUnderstanding of cloud scaling techniques and security best practicesKnowledge of Kubernetes administration for orchestration tasksAbility to collaborate with diverse functional teams seamlesslyEnglish proficiency of B2 or higherNice to have
Background in AWS services, such as EKS, RDS, CloudWatch, and X-RayFamiliarity with distributed logging systems and tools for incident automationCertifications such as Microsoft Azure Administrator or AWS Certified DevOps EngineerUnderstanding of Kubernetes configurations for scaling and advanced networking setupsProficiency in observability tools such as OpenSearch for AWS environmentsWe offer
International projects with top brandsWork with global teams of highly skilled, diverse peersHealthcare benefitsEmployee financial programsPaid time off and sick leaveUpskilling, reskilling and certification coursesUnlimited access to the LinkedIn Learning library and 22,000+ coursesGlobal career opportunitiesVolunteer and community involvement opportunitiesEPAM Employee GroupsAward-winning culture recognized by Glassdoor, Newsweek and LinkedInSeniority level
Seniority level
Mid-Senior level
Employment type
Employment type
Full-time
Job function
Job function
Engineering, Information Technology, and Business Development
Industries
Software Development, IT Services and IT Consulting, and Pharmaceutical Manufacturing
Referrals increase your chances of interviewing at EPAM Systems by 2x
Get notified about new Site Reliability Engineer jobs in Mexico .
Vehicle Propulsion Systems Engineering Ford Champ
Site Reliability Engineer - 100% Remote in Mexico
Mexico City Metropolitan Area 4 weeks ago
Senior Site Reliability Engineer / 100% Remote in Mexico
Intermediate Site Reliability Engineer - OP01625
Mexico City Metropolitan Area 4 weeks ago
Mexico City Metropolitan Area 4 weeks ago
Mexico City Metropolitan Area 4 weeks ago
Mexico City Metropolitan Area 4 weeks ago
Sr. Site Reliability Engineer (Remote, Mexico)
Mexico City Metropolitan Area 2 months ago
Guadalajara, Mexico Metropolitan Area 21 hours ago
Mexico MX$550,000.00-MX$800,000.00 3 months ago
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
J-18808-Ljbffr