Production Engineering

Software to Move You

Jan 18
12:25 PM

Site Reliability Engineering




    Site Reliability Engineering

    For the past 18 months, I've been feeling great.
When I started, the company owned tens of thousands of servers. Recently we've increased that number by an order of magnitude. Yet in most of that time, we never actually monitored our hardware for health. We left (hundreds of?) millions of dollars of hardware powered off while the warranty slowly expired, because we had no automation to diagnose or fix it.

    Then we fixed it! We orchestrated a whole reliability workflow with respective microservices, control plane, and more to ensure our physical servers health. This included coordinating with stakeholders to ensure production stability. Things were great; we had an impactful charter and were delivering real value to the company in the tens of millions annually.

    But a good developer shouldn't be looking to run a stable initiative - we should strive to stabilize teams, projects, and initiatives before handing them off. So I worked with management to staff team members in Denmark, and transition the intitiative overseas within the following couple months.

    One thing I didn't anticipate when transitioning to software development was the pace. We are constantly building things, and rarely finishing them before moving on to the next thing. The deadlines are tight and the tasks are obtuse. The difficulty is to maintain both software fundamentals and expertise on the subject you're developing for. It's so easy and common for developers to dive head-in on a software problem without adequately understanding what they're tasked with.

    We're transitioning to a security charter for our team, which should provide ample learning opportunities. Stay posted for updates!