Overview
1 week ago Be among the first 25 applicants
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
We are looking for an exceptionally skilled Lead Data Platform Operations Engineer to own the stability, security, performance, and cost efficiency of our global enterprise data platform while mentoring and guiding a team of engineers.
This position plays a pivotal role in providing operational leadership within an 8 / 5 model that integrates into a global follow-the-sun 24x5 support structure. The ideal candidate will combine deep expertise in cloud-based data platforms with leadership abilities and a strategic vision for performance optimization, observability, and cost efficiency.
Responsibilities
- Own the stability, security, and performance of the enterprise data platform (Snowflake, AWS data tools, dbt, orchestration tools, BI / analytics, etc.) while mentoring team members
- Lead operational coverage within the 8 / 5 support model and oversee team participation in a 24 / 7 on-call rotation for critical incidents
- Design and enforce robust monitoring, alerting, and observability strategies for proactive incident prevention and resolution
- Plan and coordinate platform upgrades, patching, and configuration management to meet security, compliance, and operational standards
- Define system performance benchmarks and direct tuning initiatives to align with evolving business needs
- Develop and enforce robust observability for infrastructure, data pipelines, and services to improve monitoring consistency
- Provide strategic operational insights through team-driven dashboards and reporting solutions
- Identify, prioritize, and direct process automation efforts to improve team efficiency and reduce manual work
- Advocate for and execute team initiatives for platform resilience, scalability, and cost optimization
- Promote best practices in infrastructure-as-code and configuration-as-code for team-wide adoption of repeatable operations practices
Requirements
Demonstrated hands-on expertise for 5+ years managing cloud-native data platforms (e.g., Snowflake, Databricks, BigQuery, or others)At least 1 year of relevant leadership experienceAdvanced experience with cloud infrastructure (AWS) with a strong focus on operational leadership, automation, and cost control strategiesProficiency with monitoring and observability tools (Datadog, Prometheus, Grafana, ELK, CloudWatch, etc.) and their integration into team workflowsIn-depth expertise in Infrastructure as Code tools (Terraform, Pulumi, Ansible) and leading team adoption of these practicesStrong knowledge of cloud networking, security, and compliance with the ability to guide others in these domainsProven problem-solving leadership skills to mentor team members and drive a proactive service approachExperience managing operations within a global environment, including overseeing on-call responsibilitiesEffective leadership communication and collaboration with diverse engineering, data, and business stakeholdersTrack record of driving team-wide continuous improvement and operational excellence initiativesExcellent command of written and spoken English (B2+ level)Nice to have
Proven experience driving FinOps frameworks and leading cost optimization programsPrior experience in regulated industries (pharma, healthcare, finance) with compliance-driven operational requirementsExpertise with modern data stack tools (dbt, Dagster / Airflow, ThoughtSpot, Tableau, Power BI) and their integration into team projectsExperience applying and promoting SRE (Site Reliability Engineering) practices on a team or organizational levelWe offer
International projects with top brandsWork with global teams of highly skilled, diverse peersEmployee financial programsPaid time off and sick leaveUpskilling, reskilling and certification coursesUnlimited access to the LinkedIn Learning library and 22,000+ coursesGlobal career opportunitiesVolunteer and community involvement opportunitiesEPAM Employee GroupsAward-winning culture recognized by Glassdoor, Newsweek and LinkedInSeniority
Mid-Senior levelEmployment type
Full-timeJob function
Business Development, Information Technology, and EngineeringIndustries
Software Development, IT Services and IT Consulting, and Pharmaceutical ManufacturingReferrals increase your chances of interviewing at EPAM Systems by 2x
Get notified about new Operations Engineer jobs in Mexico .
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-Ljbffr