3 weeks ago Be among the first 25 applicants
Get AI-powered advice on this job and more exclusive features.
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
Overview
Join our team as a Site Reliability Engineer , where you will focus on cloud infrastructure, containerization, and monitoring using Kubernetes and Microsoft Azure.
You will work closely with clients to ensure robust observability and efficient deployment pipelines. Apply now to contribute to maintaining and improving distributed systems at scale.
Responsibilities
- Create and manage containerized applications using Docker or Podman
- Deploy and maintain Kubernetes resource manifests in clusters such as Kind, GKE, or AKS
- Implement and monitor Prometheus agents to observe infrastructure and application metrics
- Troubleshoot and analyze logs to identify and resolve system events and issues
- Develop and maintain Azure DevOps CI / CD pipelines and GitOps deployment workflows
- Collaborate with teams to improve system reliability and deployment automation
- Manage infrastructure as code using Terraform and other tools
- Configure and maintain observability tools and alerting systems
- Ensure compliance with client constraints and security standards
- Participate in incident response and root cause analysis
- Document system configurations, processes, and procedures
- Support continuous improvement of deployment and monitoring practices
Requirements
Hands-on programming experience of at least 2 yearsProficiency in at least one scripting languageExperience with Kubernetes container orchestrationKnowledge of at least one cloud provider including Microsoft Azure or Google Cloud PlatformFamiliarity with Prometheus or similar monitoring tools for observabilityExperience with Azure DevOps CI / CD pipelines or GitOps tools like Helm and ArgoCDUnderstanding of distributed systems troubleshooting and log analysisPractical skills in containerization using Docker or PodmanExperience creating and managing Kubernetes resource manifestsAbility to deploy and monitor Prometheus agentsKnowledge of infrastructure as code tools such as TerraformStrong problem-solving and analytical skillsEffective communication and teamwork abilitiesEnglish proficiency at B2 level or higherWe offer
International projects with top brandsWork with global teams of highly skilled, diverse peersEmployee financial programsPaid time off and sick leaveUpskilling, reskilling and certification coursesUnlimited access to the LinkedIn Learning library and 22,000+ coursesGlobal career opportunitiesVolunteer and community involvement opportunitiesEPAM Employee GroupsAward-winning culture recognized by Glassdoor, Newsweek and LinkedInSeniority level
AssociateEmployment type
Full-timeJob function
Engineering, Information Technology, and Business DevelopmentIndustries
Software Development, IT Services and IT Consulting, and Nanotechnology ResearchReferrals increase your chances of interviewing at EPAM Systems by 2x
Get notified about new Site Reliability Engineer jobs in Mexico .
Current openings
Junior Site Reliability Engineer – Azure DevOpsSr. Site Reliability Engineer (Remote, Mexico)Site Reliability Engineer (SRE) – Cloud Ops Focus (Mexico Only)AI Software Engineer (Generative AI) - RemoteWe’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-Ljbffr