Esta oferta de trabajo no está disponible en tu país.

Senior Site Reliability Engineer

Valce Talent SolutionsMexico

Hace más de 30 días

Tipo de contrato

Quick Apply

Descripción del trabajo

We help our clients enhance their talent attraction capacities, especially in technological profiles.

We constantly innovate and actively seek to find the best solutions for clients and professionals. We understand the needs of our customers and aim to be the industry specialists.

We offer consulting services to technology companies in various areas, including IT, software development, cybersecurity, and project management. Our employees are the reason for the company's existence, and their satisfaction translates into that of our customers.

Job Title : Senior Site Reliability Engineer (SRE)

Experience : 5+ years Location : Mexico / LATAM

Engagement Type : Full-Time / contractual, Fully Remote

Job Description :

We are seeking a skilled Senior Site Reliability Engineer (SRE) to join our offshore team. In this role, you

will be responsible for ensuring the reliability, performance, and scalability of our critical systems. You'll

develop automation, build monitoring solutions, lead incident response, and work closely with

engineering teams to implement infrastructure as code, CI / CD, and cloud-native tools.

Job Responsibilities :

Maintain the reliability, availability, and performance of critical systems
Develop and maintain automation scripts and tools to streamline operations
Develop and maintain monitoring dashboards and alerts
Lead incident response, conduct post-mortem analysis, and implement preventative measures
Optimize system performance and scalability
Implement and maintain security best practices
Create and maintain comprehensive system and process documentation
Participate in on-call rotations for 24 / 7 critical system support

Must Haves :

Kubernetes (hands-on experience) – managing and deploying workloads

AWS Cloud Platform – deep understanding and production experience

Infrastructure as Code (IaC) – using tools like Terraform (or CloudFormation / Ansible)

Scripting / Programming – Proficiency in Python or Go

Monitoring & Alerting – Experience with Prometheus, Grafana

CI / CD Pipelines – Jenkins, GitLab CI, or similar

Incident Management – Proven experience in responding to and analyzing outages

Linux Systems & Networking – Strong fundamentals

Good to Haves :

ArgoCD, Linkerd, Karpenter, or other Kubernetes-related tools

Logging tools – Loki, ELK Stack

Security best practices – Cloud and container security knowledge

Leadership / Mentorship – Experience guiding junior engineers

Post-mortem writing & RCA – Comfortable documenting incidents and learnings

Experience in distributed systems or high-availability architectures

Recruitment Process :

AI-based online screening test

Assignment

2 client interviews

CEO Discussion

Offer : Successful candidates will receive an offer to join the team.

Soft Skills

Excellent verbal and written communication skills in English - Must

Strong problem-solving ability with a customer-first mindset

Accountability – Takes ownership of reliability and incident outcomes.

Demonstrated ability to operate in high-pressure, multitasking environments independently

Passion for supporting and helping others

Crear una alerta de empleo para esta búsqueda

Reliability Engineer • Mexico