Talent.com
Esta oferta de trabajo no está disponible en tu país.
Lead Site Reliability Engineer

Lead Site Reliability Engineer

EPAM SystemsMexico
Hace 29 días
Descripción del trabajo

6 days ago Be among the first 25 applicants

EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

Join our team as a Lead Site Reliability Engineer dedicated to providing advanced support for critical Azure-based systems.

You will address complex cloud challenges, enhance system observability, and strengthen reliability using Kubernetes, monitoring platforms, and Infrastructure-as-Code. If cloud reliability excites you and collaboration across teams inspires you, apply now to contribute to our innovative projects.

Responsibilities

  • Resolve complex incidents to ensure system availability
  • Maintain reliability and performance of Azure-based enterprise infrastructure
  • Deploy observability, monitoring, and logging tools
  • Automate infrastructure management with Terraform and scripting technologies
  • Improve system performance and uptime through centralized monitoring
  • Collaborate with multiple teams to enhance service reliability
  • Perform root cause analysis and oversee postmortems for incidents
  • Configure deployment pipelines in Azure DevOps for secure workflows
  • Write and maintain automation scripts for incident recovery and recurring tasks
  • Enhance monitoring frameworks with platforms like Prometheus and Grafana
  • Respond promptly to incidents to meet SLA expectations
  • Facilitate integration of monitoring data from Azure and AWS environments
  • Advance service reliability and observability practices continuously
  • Document processes and incident resolutions thoroughly
  • Take part in Agile team events and balance task priorities

Requirements

  • Minimum 5 years’ expertise in site reliability engineering or comparable DevOps roles
  • 1+ years of demonstrated leadership experience
  • Knowledge of Azure services, including AKS, Azure Monitor, Application Insights, Log Analytics, Cosmos DB, and PostgreSQL
  • Expertise in infrastructure automation using Azure DevOps and Terraform
  • Proficiency in scripting languages such as Bash, PowerShell, and Python
  • Skills in monitoring tools including Prometheus and Grafana
  • Background in incident management and ITSM processes with analytical capability for root cause investigations
  • Competency in resolving technical challenges promptly in high-pressure situations
  • Experience in Agile workflows and fast-paced operational environments
  • Flexibility to communicate effectively in written and verbal formats for teamwork and documentation
  • Capability to configure alerts that prevent SLA breaches proactively
  • Understanding of cloud scaling techniques and security best practices
  • Knowledge of Kubernetes administration for orchestration tasks
  • Ability to collaborate with diverse functional teams seamlessly
  • English proficiency of B2 or higher
  • Nice to have

  • Background in AWS services, such as EKS, RDS, CloudWatch, and X-Ray
  • Familiarity with distributed logging systems and tools for incident automation
  • Certifications such as Microsoft Azure Administrator or AWS Certified DevOps Engineer
  • Understanding of Kubernetes configurations for scaling and advanced networking setups
  • Proficiency in observability tools such as OpenSearch for AWS environments
  • We offer

  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn
  • Seniority level

    Seniority level

    Mid-Senior level

    Employment type

    Employment type

    Full-time

    Job function

    Job function

    Engineering, Information Technology, and Business Development

    Industries

    Software Development, IT Services and IT Consulting, and Pharmaceutical Manufacturing

    Referrals increase your chances of interviewing at EPAM Systems by 2x

    Get notified about new Site Reliability Engineer jobs in Mexico .

    Vehicle Propulsion Systems Engineering Ford Champ

    Site Reliability Engineer - 100% Remote in Mexico

    Mexico City Metropolitan Area 4 weeks ago

    Senior Site Reliability Engineer / 100% Remote in Mexico

    Intermediate Site Reliability Engineer - OP01625

    Mexico City Metropolitan Area 4 weeks ago

    Mexico City Metropolitan Area 4 weeks ago

    Mexico City Metropolitan Area 4 weeks ago

    Mexico City Metropolitan Area 4 weeks ago

    Sr. Site Reliability Engineer (Remote, Mexico)

    Mexico City Metropolitan Area 2 months ago

    Guadalajara, Mexico Metropolitan Area 21 hours ago

    Mexico MX$550,000.00-MX$800,000.00 3 months ago

    We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

    J-18808-Ljbffr

    Crear una alerta de empleo para esta búsqueda

    Site Reliability Engineer • Mexico