At SLOConf 2021 I talked about how we may use error budgets to add pass/fail criterias to reliability tests we run as part of our CI pipelines. As Site Reliability Engineers, one of our primary goals is to reduce manual labor, or toil, to a minimum while at the same time keeping the systems we manage as reliable and available as possible. To be able to do this in a safe way, it's really important that we're able to easily inspect the state of the system.
Imagine that you’re at your company’s all-hands meeting and one of the sellers is proudly ringing the office gong to celebrate closing a big deal with a client who’s on the other side of the world. It’s a big deal because it’s a major project. Their logo is going to look sleek on your website, and you are finally breaking into a new region of the world. But two months after the project kicks off, the situation isn’t looking as rosy.