Senior Site Reliability Engineer (SRE)
with advanced English skills (B2 / C1) for a full-time position.
Location : Mexico
Job Description :
We are currently seeking a highly skilled SRE Sr Engineer with solid experience to help lead transformational initiatives within IT operations and encompassing development. As a crucial figure in this role, you will participate / help designing, developing, and implementing cutting-edge SRE solutions , driving the transformation of IT operations organizations to adopt an engineering-centric approach.
Key Responsibilities
- Should be very well equipped with all SRE parameters and key metrics and transformation steps.
- Drive automation for repetitive operational tasks (toil reduction) through scripts, playbooks, and self-healing workflows.
- Design and implement automated runbooks , pipelines , and reliability blueprints to accelerate incident mitigation and enhance system resiliency.
- Knowledge of traditional support to SRE transformation is a great advantage.
- Worked in large scaled production with ITIL & SRE processes , good understanding on ticket management.
- Strong understanding on Agile / Waterfall / Scrum / Kanban and leading SRE deliverables.
- Collaborate with development teams on resiliency to ensure that services and applications are designed with operational reliability in mind.
- Implement monitoring systems to assess the performance of applications and infrastructure and proactively identifying areas for optimization.
- Understanding incident and problem management process, post-mortems, and driving improvements to prevent future incidents .
- Ability to translate technical language from Spanish to English , mainly within Monitoring Dashboards and Alerting.
Required Skills & Experience
Around 8-10 years of SRE hands on experience with cloud technologies , development, SRE toolsets and automation.Should have automation (data refresh, releases, DB snapshots) experience using Ansible or any other scripting languages.Solid Experience building AI Workflows / Operations Orchestration for Toil reduction and Issue resolution with Self-Healing.Hands-on experience in AIOPS Tools and Technologies for building AI Agents and Agentic flows.Participate in architecture of reliable , scalable, and high-performance systems and services with a focus on operational excellence, availability, and performance.Hands on experience in building Observability as a service, Telemetry data collection using Open Telemetry, APM, SolarWinds, Open-Source tools (Prometheus and Grafana) , Log Aggregations (Kibana or Splunk).Observability Single Pane Dashboarding.Strong hands-on experience with any Cloud Technology (AWS) : Control Tower, Project Setup, Creating Accounts, RDS, SSO.Solid understanding and hands on experience with Docker / Kubernetes.Should have good experience with Linux Commands, GitLab CICD Setup and Terraform (state management, etc).Monitoring & alerting setup experience with Splunk, Prometheus, Grafana, Kibana, ELK etc.Hands on APM Tool / s experience , preferably Datadog or AppDynamics or Dynatrace.Good understanding of Observability Framework leveraging programmatic SLI / SLO blueprints to standardize the collection of golden signals.Experience with following languages ( Groovy-DSL, Java, Python, Yaml and microservices architecture).Good understanding and hands on experience with MQ, Kafka.Experience with Databases (Oracle, MySQL)Nice to Have
Any of the relevant professional certifications – Certified Site Reliability Engineer (CSRE), Certified Kubernetes Administrator (CKA), AWS Certified DevOps Engineer Professional, Google Cloud Professional; DevOps Engineer, Developer background highly desired .