Site Reliability Engineering
Site Reliability Engineering
For the past 18 months, I've been feeling great.
When I
started,
the company owned tens of thousands of servers. Recently we've increased that number
by an order of magnitude. Yet in most of that time, we never actually monitored our
hardware for health. We left (hundreds of?) millions of dollars of hardware powered
off while the warranty slowly expired, because we had no automation to diagnose or
fix it.
Then we fixed it! We orchestrated a whole reliability workflow
with respective microservices, control plane, and more to ensure our physical
servers health. This included coordinating with stakeholders to ensure production
stability. Things were great; we had an impactful charter and were delivering real
value to the
company in the tens of millions annually.
But a good developer shouldn't be looking to run a stable
initiative - we should strive to stabilize teams, projects, and initiatives before
handing them off. So I worked with management to staff team members in Denmark, and
transition the intitiative overseas within the following couple months.
One thing I didn't anticipate when transitioning to software
development was the pace. We are constantly building things, and rarely finishing
them before moving on to the next thing. The deadlines are tight and the tasks are
obtuse. The difficulty is to maintain both software fundamentals and expertise on
the subject you're developing for. It's so easy and common for developers to dive
head-in on a software problem without adequately understanding what they're tasked
with.
We're transitioning to a security charter for our team, which
should provide ample learning opportunities. Stay posted for updates!