Talent.com
Esta oferta de trabajo no está disponible en tu país.
Principal SaaS Capacity Engineer

Principal SaaS Capacity Engineer

OracleZapopan, Jalisco, Mexico
Hace 2 días
Descripción del trabajo

Required Qualifications

  • Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, Cloud / Systems Engineering, or a related field.
  • 5+ years of experience in cloud infrastructure, SaaS operations, or capacity engineering roles.
  • Hands-on experience with large-scale distributed systems, OCI (or AWS, Azure, GCP), and SaaS production environments.
  • Strong programming and scripting experience (Python, Go, Shell, SQL) for automation and AI / ML model deployment.
  • Proven experience deploying AI / ML solutions for capacity forecasting, anomaly detection, and intelligent workload tuning.
  • Deep understanding of cloud capacity topology and distributed service dependencies.
  • Proficiency with infrastructure-as-code (Terraform, Ansible, Helm, Kubernetes).
  • Familiarity with AIOps tools and AI-driven observability platforms (Datadog, Dynatrace, Splunk, or similar).
  • Knowledge of resiliency and disaster recovery strategies, including AI-simulated failure modeling.

Preferred Qualifications

  • Advanced degree (Master’s / PhD) with specialization in AI, ML, Data Science, or distributed systems engineering.
  • Experience building and deploying self-healing, AI-driven automation at scale in a SaaS environment.
  • Domain expertise in reinforcement learning applications for automated resource optimization.
  • Direct exposure to Oracle Cloud Infrastructure (OCI) systems and tools.
  • Experience with cloud-native AI / ML services, MLOps, and continuous model monitoring.
  • Competencies and Skills

  • Expertise in designing, developing, and deploying AI / ML models for cloud infrastructure use cases (demand forecasting, anomaly detection, workload optimization).
  • Advanced proficiency in automation, orchestration, and self-healing system architectures.
  • Skilled in communicating technical concepts, AI-powered analytics, and strategic insights to engineering and executive audiences.
  • Strong analytical and critical thinking skills, with a deep data-driven mindset.
  • Curiosity and initiative to explore APIs, system profiles, and operational anomalies, translating technical findings into impactful business outcomes.
  • Highly collaborative, adaptive, and passionate about operational excellence and continuous learning.
  • Ability to influence cross-team priorities and drive best practices in AI-enhanced capacity engineering.
  • Qualifications

    Career Level - IC4

    Responsibilities

  • Service Accountability : Ensure SaaS production capacity availability, optimization, scaling automation, reserve management, and quota governance.
  • AI / ML Integration : Apply AI / ML models for predictive capacity forecasting, anomaly detection, and workload auto-tuning to anticipate demand spikes and prevent outages.
  • Observability & AIOps : Leverage AI-powered observability and AIOps platforms for end-to-end system monitoring, intelligent alerting, and automated incident mitigation.
  • Strategic Partnership : Collaborate with Product and Development teams to design, validate, and align AI-driven scaling and capacity planning strategies with new launches and initiatives.
  • Automation & Orchestration : Design, implement, and optimize automation and orchestration pipelines, including self-healing systems, policy-driven provisioning, and disaster recovery simulations, using AI to enhance reliability and operational resilience.
  • Data-Driven Decision Support : Deliver advanced instrumentation, AI-powered analytics, and actionable dashboards to inform executives, engineering teams, and stakeholders.
  • Technical Leadership : Translate complex OCI stack and cloud platform resources (compute, storage, DB, networking) into business-ready, AI-enhanced capacity solutions and performance profiles.
  • Simulation & Resiliency : Use AI / ML models to simulate, validate, and improve resiliency and disaster recovery scenarios for faster, more robust recovery response.
  • Collaboration & Communication : Present AI-driven insights, risks, and recommendations to engineering teams, ICs, and executives to illuminate capacity trends and data-driven priorities.
  • Continuous Innovation : Assess new AI / ML techniques, AIOps platforms, and automation tools for ongoing improvements in infrastructure reliability, scalability, and cost optimization.
  • #J-18808-Ljbffr

    Crear una alerta de empleo para esta búsqueda

    Principal Engineer • Zapopan, Jalisco, Mexico