Esta oferta de trabajo no está disponible en tu país.

Lead Site Reliability Engineer

EPAM SystemsMexico

Hace más de 30 días

Descripción del trabajo

6 days ago Be among the first 25 applicants

EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

Join our team as a Lead Site Reliability Engineer dedicated to providing advanced support for critical Azure-based systems.

You will address complex cloud challenges, enhance system observability, and strengthen reliability using Kubernetes, monitoring platforms, and Infrastructure-as-Code. If cloud reliability excites you and collaboration across teams inspires you, apply now to contribute to our innovative projects.

Responsibilities

Resolve complex incidents to ensure system availability
Maintain reliability and performance of Azure-based enterprise infrastructure
Deploy observability, monitoring, and logging tools
Automate infrastructure management with Terraform and scripting technologies
Improve system performance and uptime through centralized monitoring
Collaborate with multiple teams to enhance service reliability
Perform root cause analysis and oversee postmortems for incidents
Configure deployment pipelines in Azure DevOps for secure workflows
Write and maintain automation scripts for incident recovery and recurring tasks
Enhance monitoring frameworks with platforms like Prometheus and Grafana
Respond promptly to incidents to meet SLA expectations
Facilitate integration of monitoring data from Azure and AWS environments
Advance service reliability and observability practices continuously
Document processes and incident resolutions thoroughly
Take part in Agile team events and balance task priorities

Requirements

Minimum 5 years’ expertise in site reliability engineering or comparable DevOps roles

1+ years of demonstrated leadership experience

Knowledge of Azure services, including AKS, Azure Monitor, Application Insights, Log Analytics, Cosmos DB, and PostgreSQL

Expertise in infrastructure automation using Azure DevOps and Terraform

Proficiency in scripting languages such as Bash, PowerShell, and Python

Skills in monitoring tools including Prometheus and Grafana

Background in incident management and ITSM processes with analytical capability for root cause investigations

Competency in resolving technical challenges promptly in high-pressure situations

Experience in Agile workflows and fast-paced operational environments

Flexibility to communicate effectively in written and verbal formats for teamwork and documentation

Capability to configure alerts that prevent SLA breaches proactively

Understanding of cloud scaling techniques and security best practices

Knowledge of Kubernetes administration for orchestration tasks

Ability to collaborate with diverse functional teams seamlessly

English proficiency of B2 or higher

Nice to have

Background in AWS services, such as EKS, RDS, CloudWatch, and X-Ray

Familiarity with distributed logging systems and tools for incident automation

Certifications such as Microsoft Azure Administrator or AWS Certified DevOps Engineer

Understanding of Kubernetes configurations for scaling and advanced networking setups

Proficiency in observability tools such as OpenSearch for AWS environments

We offer

International projects with top brands

Work with global teams of highly skilled, diverse peers

Healthcare benefits

Employee financial programs

Paid time off and sick leave

Upskilling, reskilling and certification courses

Unlimited access to the LinkedIn Learning library and 22,000+ courses

Global career opportunities

Volunteer and community involvement opportunities

EPAM Employee Groups

Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

Seniority level

Mid-Senior level

Employment type

Full-time

Job function

Engineering, Information Technology, and Business Development

Industries

Software Development, IT Services and IT Consulting, and Pharmaceutical Manufacturing

Referrals increase your chances of interviewing at EPAM Systems by 2x

Get notified about new Site Reliability Engineer jobs in Mexico .

Vehicle Propulsion Systems Engineering Ford Champ

Site Reliability Engineer - 100% Remote in Mexico

Mexico City Metropolitan Area 4 weeks ago

Senior Site Reliability Engineer / 100% Remote in Mexico

Intermediate Site Reliability Engineer - OP01625

Mexico City Metropolitan Area 4 weeks ago

Sr. Site Reliability Engineer (Remote, Mexico)

Mexico City Metropolitan Area 2 months ago

Guadalajara, Mexico Metropolitan Area 21 hours ago

Mexico MX$550,000.00-MX$800,000.00 3 months ago

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

J-18808-Ljbffr

Crear una alerta de empleo para esta búsqueda

Site Reliability Engineer • Mexico