The Principal Site Reliability Engineer will be responsible for understanding our core technology, developing a repeatable, automated resolution process and having the right talent/skillset to develop and improve the process. Responsible managing where the main elements of incidents converge and automate recovery of future incidents in our test environments. Collaborate and communication across multiple departments is a must for this role. If you are highly motivated and goal oriented, can handle interruptions while fluidly switching between several projects, and have an automation approach to solving problems, this job will be ideal.
Long-term service reliability in test environments, increasing the odds that when a problem gets fixed, it stays fixed
Enable quicker response and resolution, and repeatable workflows, using automation, that accelerate the remediation process
Improve service observability — set SLOs, SLAs, and SLIs; working with product teams and technology teams alike.
Own end-to-end availability of key services and build automation to prevent problem recurrence.
Assist in incident action items for control breaks to ensure issues do not result in repeat incidents.
Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
Automate response to all non-exceptional service conditions.
Lead by example, mentor the team and establish credibility through quality technical execution.
Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
Encourage and minimize manual systems work to focus on efforts that bring long-term value to the system.
Evaluate potential failures and their effects on the system.
Develop and deploy operational test cases to catch issues in lower environments.