Site Reliability Engineer – Azure DevOps
1 week ago Be among the first 25 applicants
Get AI-powered advice on this job and more exclusive features.
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
Join our team as a Site Reliability Engineer , where you will ensure system reliability, manage incident responses, and enable seamless collaboration between operations and development teams.
This role demands a background in Oil & Gas combined with expertise in automation and cloud technologies. Apply now to support critical infrastructure and drive operational excellence.
Responsibilities
- Oversee and enhance the product monitoring system
- Handle incidents, including troubleshooting, resolution, documentation, and analysis
- Distribute knowledge and insights across teams
- Facilitate collaboration between operations and development
- Create automation for log analysis, testing production systems, and alerting
- Track system health, performance, and SLIs / SLOs / SLAs
- Maintain documentation for incident management procedures
- Conduct incident analyses and implement corrective actions
- Respond to on-call support requests during and after business hours
- Collaborate with teams to enhance system efficiency and reliability
- Leverage tools such as PagerDuty, ELK / Kibana, SEQ logging, Prometheus, and Grafana for system monitoring
- Develop scripts and implement automation solutions using Python, C#, and Bash
- Manage orchestration and infrastructure through SaltStack and Docker
- Support project workflows using Azure DevOps and maintain a comprehensive Wiki
- Maintain code repositories and implement version control systems using Git
Requirements
1+ years of experience in creating solutions, particularly in Site Reliability EngineeringExpertise in cloud services and automation scripting with Python and BashBackground in Oil & Gas operations and incident handlingSkill in managing incident responses and providing on-call supportFamiliarity with monitoring tools such as Prometheus and GrafanaProficiency in logging tools like ELK / Kibana and SEQ loggingKnowledge of orchestration and infrastructure solutions including SaltStack and DockerUnderstanding of fundamental networking concepts like inbound / outbound rules and firewallsProficiency in tools for project management and issue tracking like Azure DevOpsCapability to manage source code with GitStrong skills in creating documentation and disseminating knowledgeCompetency in conducting detailed post-incident reviewsExcellent troubleshooting abilities and problem-solving skillsEffective communication skills, with an English level of at least B2Nice to have
Experience using PagerDuty for incident handlingCompetency in C# programmingUnderstanding of SQL and MongoDB databasesBackground in Zededa infrastructureExperience in supporting Oil & Gas field operationsWe offer
International projects with top brandsWork with global teams of highly skilled, diverse peersEmployee financial programsPaid time off and sick leaveUpskilling, reskilling and certification coursesUnlimited access to the LinkedIn Learning library and 22,000+ coursesGlobal career opportunitiesVolunteer and community involvement opportunitiesEPAM Employee GroupsAward-winning culture recognized by Glassdoor, Newsweek and LinkedInSeniority level
Associate
Employment type
Full-time
Job function
Engineering, Information Technology, and Business Development
Industries
Software Development, IT Services and IT Consulting, and Nanotechnology Research
#J-18808-Ljbffr