In programming, we use assertions all the time to make sure our assumptions about what the code should do are (still) valid. Whenever the resulting state is not what we expect it to be, the program stops and we receive some sort of error notification.
We’ve set up our systems to email us about it, so in the below example we would get an email if
some_function no longer works as we expect it to:
offset = some_function()
assert offset % 4096 == 0, "The resulting value should be a multiple of 4096"
We use this concept of assertions everywhere in our business. Some of those are in code, but many others run as periodic scripts (cronscripts). They’re small bots if you will, constantly checking if all kinds of “business state” is still what we expect it to be. Those checks can be about all kinds of things: from accounting, to hardware, to invoices and security.
Rather than having to manually check if everything works as expected, we have our systems tell us when something is wrong. Push vs pull. It’s like the concept of a dark cockpit in modern planes. Rather than staring at dozens of lights and graphs and make sure they have the right color and right value, we keep working without distractions and get notified when a potential problem arises.
The checks don’t do anything other than informing us through an (email) notification. We don’t add “smart” logic to automatically try and correct the problem. The entire point of these checks is that the state doesn’t match our expectations, so those events will be rare and require investigation. At best it’s premature optimization to worry about a solution before, and worst case automatically correcting it based on assumptions which no longer hold cause even more problems.
Over the years we’ve added plenty of these checks in our existing systems, and we will definitely use them again this time. They’ve saved us a lot of headaches by catching issues early on, and quite often the fix was trivial. Some examples:
A loose network cable
We run several scripts on the server where we expect the output to be empty. For example, we run a command like the one below to check the status of the network.
$ ethtool eth0 | egrep 'Speed|Duplex' | egrep -v '(10000Mb/s|Full)'
When the output is empty everything is as it should be. When there’s output, we get an email about it. At one point we got the following email message:
Because we found out right away it was easy to trace the problem down to a bad cable which was causing network errors in our data center.
Just like the example above, we have many more of these little bots checking all kinds of configuration. Sometimes we hash the entire output to make it easy to check if the output still matches. For example, this command checks if all services are running with the exact status we expect them to have:
$ systemctl -a | md5sum | grep -v 36adc2b4e6a5798c9a8348cfbfcd00e0
When there’s output, something’s off. We do the same for all kinds of other configuration (maximum number of open files, kernel modules, IPv6 support, firewall rules etc). This has especially saved us time when updating packages or the OS itself. Even when you read the changelog, some subtle changes in defaults or configuration file locations might break your assumptions.
We also have scripts checking configuration files against typos or security issues (such as constantly checking our web server config for common misconfigurations). If we would ever edit the config and make a mistake, we’ll get an email a minute later.
Many APIs have a tendency to change over time. Of course you try to keep everything updated, but there’s no guarantee you’ll catch all unexpected behavior or that you have time to look into every new change.
We use Stripe to process our credit card charges. Obviously it’s important this works correctly. To make sure we’re not missing any important billing state or changes we didn’t even know existed, we automatically send ourselves an error message whenever we receive a webhook we didn’t know about at the time we wrote the code.
Although we process most invoices automatically using Stripe, many larger B2B accounts often pay using wire transfers. Automating every last bit isn’t always the right trade off, but we’re also not going to have staff to waste time on making sure invoices get paid. So we have a simple bot which notifies us every week if an unpaid invoice has a due date in the past. It keeps emailing us every week so we can’t forget.
In 2020, some EU countries temporarily lowered their VAT rates for a few months because of the pandemic. Who knew? We certainly didn’t. Luckily, we didn’t have to, because our VAT-rates-bot told us on the 1st of the month that something was off.
All our other accounting is code, too. We can calculate down to the cent that the balance of our accounts should be equal to the sum of the opening balance and all transactions and invoices we’ve recorded since. The strangest error we’ve discovered this way is that someone on the other side of the world accidentally paid their speeding ticket to our account instead of the traffic authorities. You can’t make this stuff up, but luckily our bots told us about it.
Sometimes you introduce a feature where you have to make assumtions on how it’s going to be used. For example, we allow users to create workspaces when they work in multiple teams. For the first version, the UI won’t be very advanced. Will it work well with 5 teams? Sure. Will it work with 100 teams? Maybe not. But why worry about something that might never happen. We put in a “soft assert” (which warns us but lets the user continue), and investigate what to do about it when we start seeing people go over limits.
And of course the list goes on. Sometimes new problems come up, or we read about issues which might be relevant for us too, and we add a little check. Set and forget. Assert all the things.