Overview
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
We are seeking a highly skilled Senior Operational Intelligence Developer to join our team, responsible for supporting, enhancing, and maintaining our Elastic & Observability Platform deployed across GCP and Elastic Cloud. This role will involve developing innovative solutions, maintaining platform reliability, and enabling self-service capabilities to empower platform consumers while participating in an on-call rotation to oversee platform health and functionality.
Responsibilities
- Ensure availability, functionality, performance, and security of observability and search platforms to meet business SLAs
- Respond to incidents and resolve escalations promptly during on-call periods
- Maintain platform documentation, standard operating procedures, and operational guidelines
- Collaborate with stakeholders and vendors to manage operational requirements, installations, and upgrades
- Enhance platform features and self-service capabilities, including Elastic Synthetics and chargeback automation
- Design proof-of-concepts for operational improvements like AI-driven observability or Kubernetes migration
- Build, deploy, and maintain Elastic clusters using Infrastructure-as-Code (IaC) tools like Terraform and Ansible
- Perform platform lifecycle management activities such as component upgrades, capacity planning, and cost optimisation
- Fine-tune ELK stack performance across ingestion, indexing, and query layers
- Configure and manage comprehensive alerting and incident management workflows, including Kibana Rules, Watchers, and PagerDuty
- Support ingestion, enrichment, backup, and restoration of platform data
- Plan and manage SSL certificate rotations and cluster scalability requirements
Requirements
3+ years of experience in Operational IntelligenceProven expertise in implementing, operating, and managing Elastic clustersKnowledge of Elastic Stack components, including Elasticsearch, Kibana, and LogstashProficiency in Infrastructure-as-Code (IaC) tools such as Terraform and Ansible, with flexibility to use Jenkins CISkills in Python for automation and extending platform functionalityUnderstanding of incident management workflows with tools like PagerDuty and UptrendsBackground in troubleshooting and resolving complex platform issues efficientlyCompetency in managing scalable, fault-tolerant platforms with a focus on performance and securityStrong communication skills in English (B2 level or higher) for collaborating with technical and non-technical stakeholdersNice to have
Familiarity with additional tools such as Groovy, Linux Administration, and Jenkins CI pipelinesCapability to optimise observability workflows using advanced integrations in Uptrends and PagerDutyShowcase of previous work with Elastic Synthetics for advanced monitoring and testingWe offer
International projects with top brandsWork with global teams of highly skilled, diverse peersEmployee financial programsPaid time off and sick leaveUpskilling, reskilling and certification coursesUnlimited access to the LinkedIn Learning library and 22,000+ coursesGlobal career opportunitiesVolunteer and community involvement opportunitiesEPAM Employee GroupsAward-winning culture recognized by Glassdoor, Newsweek and LinkedIn#J-18808-Ljbffr