Overview
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
We are looking for a highly experienced and dynamic Lead Operational Intelligence Developer to join our team.
In this role, you will take ownership of leading the development, maintenance, and enhancement of our Elastic & Observability Platform deployed across GCP and Elastic Cloud. You will drive strategic initiatives, guide a high-performing technical team, and ensure platform reliability while fostering innovation and enabling self-service capabilities for platform consumers. This position also involves participating in an on-call rotation to oversee platform health and functionality.
Responsibilities
- Oversee the availability, functionality, performance, and security of observability and search platforms to exceed business SLAs
- Provide technical leadership during complex incidents and escalate resolutions promptly during on-call periods
- Develop and maintain comprehensive platform documentation, standard operating procedures, and knowledge-sharing resources
- Collaborate with cross-functional teams, stakeholders, and vendors to oversee operational requirements, drive strategic initiatives, and manage installations, troubleshooting, and upgrades
- Lead the enhancement of platform features and self-service capabilities, including advanced Elastic Synthetics and chargeback automation
- Architect and implement proof-of-concepts for platform innovation, such as AI-driven observability, advanced data processing models, or Kubernetes-based platform migration
- Supervise the building, deployment, and maintenance of Elastic clusters using Infrastructure-as-Code (IaC) tools like Terraform and Ansible, while mentoring team members on best practices
- Oversee platform lifecycle management activities, including component upgrades, capacity planning, cost optimization, and evolving compliance requirements
- Continuously assess and fine-tune ELK stack performance, including ingestion, indexing, and query optimization for large-scale environments
- Establish and enhance comprehensive alerting and incident management workflows, integrating sophisticated monitoring tools such as Kibana Rules, Watchers, and PagerDuty
- Supervise the ingestion, enrichment, backup, and restoration of large-scale platform data while optimizing data workflows
- Lead and plan critical operational events such as SSL certificate rotations, cluster migrations, or scalability optimization projects
Requirements
5+ years of experience in Operational Intelligence, with a proven track record of leadership and technical expertise in managing large-scale observability platformsDemonstrated ability to architect and manage Elastic clusters in complex, multi-cloud environmentsIn-depth knowledge of Elastic Stack components, including advanced configurations of Elasticsearch, Kibana, and LogstashAdvanced proficiency in Infrastructure-as-Code (IaC) tools like Terraform and Ansible, with demonstrated flexibility in adapting other tools like Jenkins CI or GitOps frameworksAdvanced Python scripting skills for automation, data processing, and extending platform interoperabilityDeep understanding of incident management frameworks and workflows with tools like PagerDuty, Uptrends, and other enterprise monitoring solutionsProven expertise in troubleshooting and resolving complex platform challenges under tight SLAsStrong capability in managing and scaling fault-tolerant platforms while ensuring performance, security, and compliance across large distributed systemsDemonstrated ability to mentor and grow team members, manage priorities, and act as a bridge between technical and non-technical teamsExcellent command of English (B2+ level), both written and spoken, with a strong emphasis on technical communication skillsNice to have
Expertise in scripting with Groovy or experience in advanced Linux administration to optimize platform processesTrack record of optimizing observability workflows with additional integrations or customizations in tools like Uptrends, PagerDuty, or Elastic featuresHands-on experience with advanced Elastic Synthetics setups for robust monitoring and custom synthetic testing frameworksExperience driving strategic initiatives such as modernization through AI tooling, cloud-native transitions, or cost-saving observability optimizationsWe offer
International projects with top brandsWork with global teams of highly skilled, diverse peersEmployee financial programsPaid time off and sick leaveUpskilling, reskilling and certification coursesUnlimited access to the LinkedIn Learning library and 22,000+ coursesGlobal career opportunitiesVolunteer and community involvement opportunitiesEPAM Employee GroupsAward-winning culture recognized by Glassdoor, Newsweek and LinkedInSeniority level
Mid-Senior level
Employment type
Full-time
Job function
Information Technology, Engineering, and Business Development
Industries
Software Development, IT Services and IT Consulting, and Venture Capital and Private Equity Principals
#J-18808-Ljbffr