Talent.com
Site Reliability Engineer (Middle / Senior) ID38916

Site Reliability Engineer (Middle / Senior) ID38916

AgileEngineSan Luis Potosí, San Luis Potosí, Mexico
Hace 1 día
Tipo de contrato
  • Teletrabajo
Descripción del trabajo

Overview

AgileEngine is an Inc. 5000 company that creates award-winning software for Fortune 500 brands and trailblazing startups across 17+ industries. We rank among the leaders in areas like application development and AI / ML, and our people-first culture has earned us multiple Best Place to Work awards.

Be among the first 25 applicants. 3 weeks ago

What you will do

  • Shift : Monday – Thursday 8AM – 7PM PST (11AM – 10PM EST) with rotating on-call
  • On call shifts : every 6 weeks, for one week as primary responder and next week as secondary
  • Manage alerts daily, check systems, and escalate issues as needed
  • Be part of a team that provides 24×7 on-call support for critical SaaS events
  • Be available in case of emergencies when team members are not available or need help
  • Document issues and remediation steps
  • Proactively create appropriate monitors in the EKS / K8S ecosystem
  • Deploy to EKS / K8s cluster using Terraform and Helm
  • Learn and maintain existing infrastructure running under Docker Swarm
  • Improve existing infrastructure health by implementing checks and scripts to correct known issues
  • Maintain and develop deployment code
  • Automate manual tasks
  • Implement / integrate new technologies in our Cloud Infrastructure
  • Collaborate with other teams and departments to provide the highest level of support and assistance
  • Apply a real customer focus when planning deployments / updates, having the customer in the forefront of the mind, and considering the impact on them before making changes
  • Work closely on solutions with Support, Customer Success, Migration, and Professional Services teams to provide the best in class SaaS service to our customers
  • Perform RCA and take necessary corrective actions to prevent the recurrence of issues
  • Create and assign alert-related actions to the appropriate team after the investigation
  • Handle support requests for environment-specific actions
  • Identify and provide automation requirements to improve RCA

Must haves

  • 2+ years of professional experience
  • Experience working with Datadog
  • Hands-on experience as an AWS Cloud Engineer
  • Working knowledge of EKS / Terraform / Helm
  • Working experience with Docker and Docker Swarm
  • Good understanding of AWS IAM roles and policies
  • Experience logging and monitoring AWS resources using CloudWatch logs
  • Experience working in a Linux environment
  • Proficient in Bash and / or Python scripting
  • A strong understanding of web technologies such as REST APIs
  • Working experience with monitoring solutions, such as Grafana and Prometheus
  • Excellent oral and written communication skills; customer-facing communication skills to explain issues and RCAs
  • Experience in Product / Application Support for SaaS-based products
  • Understanding of APIs, Databases, Systems Architecture, and Design
  • Designing, implementing, and operating in a DevSecOps environment
  • Excellent communication skills, both written and verbal
  • Ability to work independently as well as within a collaborative environment
  • A technical aptitude with the desire to learn new and evolving technologies
  • Upper-Intermediate English level
  • Nice to have

  • Experience with GCP or Azure
  • Certifications : AWS Certified DevOps Engineer – Professional or AWS Certified Advanced Networking Specialty
  • Perks and benefits

  • Professional growth : Accelerate your professional journey with mentorship, TechTalks, and personalized growth roadmaps.
  • Competitive compensation : We match your ever-growing skills, talent, and contributions with competitive USD-based compensation and budgets for education, fitness, and team activities.
  • A selection of exciting projects : Join projects with modern solutions development and top-tier clients that include Fortune 500 enterprises and leading product brands.
  • Flextime : Tailor your schedule for an optimal work-life balance, with options for remote or in-office work to suit your needs.
  • Seniority level

  • Mid-Senior level
  • Employment type

  • Full-time
  • Industry

  • IT Services and IT Consulting
  • Referrals increase your chances of interviewing at AgileEngine.

    #J-18808-Ljbffr

    Crear una alerta de empleo para esta búsqueda

    Site Reliability Engineer • San Luis Potosí, San Luis Potosí, Mexico