Senior Site Reliability Engineer (SRE)
with advanced English skills (B2 / C1) for a full-time position.
Location : Mexico
Job Description :
We are currently seeking a highly skilled SRE Sr Engineer with solid experience to help lead transformational initiatives within IT operations and encompassing development. As a crucial figure in this role, you will participate / help designing, developing, and implementing cutting-edge SRE solutions, driving the transformation of IT operations organizations to adopt an engineering-centric approach.
Key Responsibilities
- Should be very well equipped with all SRE parameters and key metrics and transformation steps.
- Drive automation for repetitive operational tasks (toil reduction) through scripts, playbooks, and self-healing workflows.
- Design and implement automated runbooks, pipelines, and reliability blueprints to accelerate incident mitigation and enhance system resiliency.
- Knowledge of traditional support to SRE transformation is a great advantage.
- Worked in large scaled production with ITIL & SRE processes, good understanding on ticket management.
- Strong understanding on Agile / Waterfall / Scrum / Kanban and leading SRE deliverables.
- Collaborate with development teams on resiliency to ensure that services and applications are designed with operational reliability in mind.
- Implement monitoring systems to assess the performance of applications and infrastructure and proactively identifying areas for optimization.
- Understanding incident and problem management process, post-mortems, and driving improvements to prevent future incidents.
- Ability to translate technical language from Spanish to English, mainly within Monitoring Dashboards and Alerting.
Required Skills & Experience
Around 8-10 years of SRE hands on experience with cloud technologies, development, SRE toolsets and automation.Should have automation (data refresh, releases, DB snapshots) experience using Ansible or any other scripting languages.Solid Experience building AI Workflows / Operations Orchestration for Toil reduction and Issue resolution with Self-Healing.Hands-on experience in AIOPS Tools and Technologies for building AI Agents and Agentic flows.Participate in architecture of reliable, scalable, and high-performance systems and services with a focus on operational excellence, availability, and performance.Hands on experience in building Observability as a service, Telemetry data collection using Open Telemetry, APM, SolarWinds, Open-Source tools (Prometheus and Grafana), Log Aggregations (Kibana or Splunk).Observability Single Pane Dashboarding.Strong hands-on experience with any Cloud Technology (AWS) : Control Tower, Project Setup, Creating Accounts, RDS, SSO.Solid understanding and hands on experience with Docker / Kubernetes.Should have good experience with Linux Commands, GitLab CICD Setup and Terraform (state management, etc).Monitoring & alerting setup experience with Splunk, Prometheus, Grafana, Kibana, ELK etc.Hands on APM Tool / s experience, preferably Datadog or AppDynamics or Dynatrace.Good understanding of Observability Framework leveraging programmatic SLI / SLO blueprints to standardize the collection of golden signals.Experience with following languages (Groovy-DSL, Java, Python, Yaml and microservices architecture).Good understanding and hands on experience with MQ, Kafka.Experience with Databases (Oracle, MySQL)Nice to Have
Any of the relevant professional certifications – Certified Site Reliability Engineer (CSRE), Certified Kubernetes Administrator (CKA), AWS Certified DevOps Engineer Professional, Google Cloud Professional; DevOps Engineer, Developer background highly desired.