Esta oferta de trabajo no está disponible en tu país.
Principal SaaS Capacity Engineer
OracleZapopan, Jalisco, Mexico
Hace 2 días
Descripción del trabajo
Required Qualifications
Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, Cloud / Systems Engineering, or a related field.
5+ years of experience in cloud infrastructure, SaaS operations, or capacity engineering roles.
Hands-on experience with large-scale distributed systems, OCI (or AWS, Azure, GCP), and SaaS production environments.
Strong programming and scripting experience (Python, Go, Shell, SQL) for automation and AI / ML model deployment.
Proven experience deploying AI / ML solutions for capacity forecasting, anomaly detection, and intelligent workload tuning.
Deep understanding of cloud capacity topology and distributed service dependencies.
Proficiency with infrastructure-as-code (Terraform, Ansible, Helm, Kubernetes).
Familiarity with AIOps tools and AI-driven observability platforms (Datadog, Dynatrace, Splunk, or similar).
Knowledge of resiliency and disaster recovery strategies, including AI-simulated failure modeling.
Preferred Qualifications
Advanced degree (Master’s / PhD) with specialization in AI, ML, Data Science, or distributed systems engineering.
Experience building and deploying self-healing, AI-driven automation at scale in a SaaS environment.
Domain expertise in reinforcement learning applications for automated resource optimization.
Direct exposure to Oracle Cloud Infrastructure (OCI) systems and tools.
Experience with cloud-native AI / ML services, MLOps, and continuous model monitoring.
Competencies and Skills
Expertise in designing, developing, and deploying AI / ML models for cloud infrastructure use cases (demand forecasting, anomaly detection, workload optimization).
Advanced proficiency in automation, orchestration, and self-healing system architectures.
Skilled in communicating technical concepts, AI-powered analytics, and strategic insights to engineering and executive audiences.
Strong analytical and critical thinking skills, with a deep data-driven mindset.
Curiosity and initiative to explore APIs, system profiles, and operational anomalies, translating technical findings into impactful business outcomes.
Highly collaborative, adaptive, and passionate about operational excellence and continuous learning.
Ability to influence cross-team priorities and drive best practices in AI-enhanced capacity engineering.
Qualifications
Career Level - IC4
Responsibilities
Service Accountability : Ensure SaaS production capacity availability, optimization, scaling automation, reserve management, and quota governance.
AI / ML Integration : Apply AI / ML models for predictive capacity forecasting, anomaly detection, and workload auto-tuning to anticipate demand spikes and prevent outages.
Observability & AIOps : Leverage AI-powered observability and AIOps platforms for end-to-end system monitoring, intelligent alerting, and automated incident mitigation.
Strategic Partnership : Collaborate with Product and Development teams to design, validate, and align AI-driven scaling and capacity planning strategies with new launches and initiatives.
Automation & Orchestration : Design, implement, and optimize automation and orchestration pipelines, including self-healing systems, policy-driven provisioning, and disaster recovery simulations, using AI to enhance reliability and operational resilience.
Data-Driven Decision Support : Deliver advanced instrumentation, AI-powered analytics, and actionable dashboards to inform executives, engineering teams, and stakeholders.
Technical Leadership : Translate complex OCI stack and cloud platform resources (compute, storage, DB, networking) into business-ready, AI-enhanced capacity solutions and performance profiles.
Simulation & Resiliency : Use AI / ML models to simulate, validate, and improve resiliency and disaster recovery scenarios for faster, more robust recovery response.
Collaboration & Communication : Present AI-driven insights, risks, and recommendations to engineering teams, ICs, and executives to illuminate capacity trends and data-driven priorities.
Continuous Innovation : Assess new AI / ML techniques, AIOps platforms, and automation tools for ongoing improvements in infrastructure reliability, scalability, and cost optimization.