As a follow up from my last post about things I look for when hiring a junior DevOps engineer - these are the things I'm looking for out of a senior engineer. It all boils down to what I consider the core of DevOps - helping engineering teams ship their apps quickly and safely to production.
At a high level, this means being able to provide:
- a CI workflow
- knowledge of release workflows
- an automated delivery workflow of multiple services to multiple test environments
- production monitoring & alerting
A senior engineer should know how to provide these basic resources to the engineering team in a way that allow for quick release cycles. They might have a specialty in one of these areas, but know enough to provide a decent foundation across all of these areas. Let's see what this means in practice.
Setup and support a CI system/workflow to support many services spanning multiple engineering teams.
This is like basic need number one of any developer or dev team. These questions should have decent answers:
- How/Where should I run my browser tests?
- How/Where do I test my configuration/infrastructure-as-code?
- How do I get more nodes to run more tests?
- Can your teams build out their pipelines as they want?
- Are they able to deploy to production on their own?
- How do I deploy?
For example - a team of 20 engineers are building an android app - maybe you set up AWS Device Farm, Code Build, and CodePipeline and provide terraform/cloudformation modules and best practices to get your teams onboarded quickly. They should be able to view logs, retry builds, see screenshots of their tests to name a few. If you're a small team with just a few services, TravisCI/CircleCI might be good enough. If you're a bigger team, you might opt to use Gitlab. If you have very specific testing requirements, you might need to lean on Jenkins.
Some practices scale better than others, and a lot of it has to do with some of your organization culture. I've seen dozens of teams each managing their own Jenkins instance, as well as a dozen teams depending on a centrally managed Jenkins instance. In one context it works great, while at another, it might be a nightmare in practice.
Have seen multiple release workflows
As the number of sites/services that your team creates increases, you need to have a method to the madness to get things shipped out quickly to customers and also keep all your internal stakeholders updated as you ship out changes. I'd hope they've gotten to see a good many of them and most importantly, I'd love for someone to be able to provide great feedback about a poor release workflow they've seen.
Some teams create interesting workflows to support it. Things like a release train for organizing releases across teams, or a continuous delivery environment where everything gets shipped after it goes through the pipeline. A senior engineer would be able to recommend and implement a proof-of-concept with other engineering teams outlining the release process. Will you be using an ITIL workflow
Set up automated deployments to different runtimes and environments
At the beginning, your teams might be focused on JS, but there will come a time where there's a solution that can only be built using a different language. Or a different framework. As your teams grow, your configuration management solution should scale to the needs of your teams. Are they able to deploy java apps as quickly/easily and nodejs apps? What if you need to introduce python/go into the mix, how much or how little of your pipeline needs to change?
You might set up a docker build/deploy workflow where teams can build/deploy whatever they want to their container to a mangaged k8 cluster, OR you might have a Chef/Ansible workflow where deployments are centrally executed from chef-server or Ansible tower. Maybe using
scp to deploy to a remote server is really all you need.
Configure and scale a distributed monitoring platform for applications and infrastructure
When you deploy your application, you need to see whether or not it works as intended. Your teams should have all the tools available at their disposal - they should be able to view things like:
- CPU/memory metrics and plot it on a graph alongside other hosts
- centralized logs across all hosts
As you get more mature and your needs grow, you'll want to add more custom metrics to your application like application level details using tools like NewRelic APM and/or Datadog.
To start with, maybe using a SaaS tool like Papertrail or logz.io might be good enough. As your services grow and needs change, it might make sense to use a bigger provider like Splunk or Elastic. Or if your senior engineers have a specialty in this area, they can deploy an elasticsearch + kibana + logstash/fluentd architecture and link it automatically to all of your services.
There are other things that I'm paying attention to with senior engineers (especially with regards to things like security practices), but their experiences and solutions to these common basic problems is what I'm interested in the most. What I'm looking for is experience seeing different types of applications get deployed across a wide variety of teams with differing requirements and priorities. After you've seen a few, you're able to see processes that work and ones that don't.
How you release to production is a particular interest of mine, and there's always something to learn in what other companies do. What's the best/worst release process you've seen? Let me know in the comments.