Perils of caching

We try to design our software to eliminate entire classes of bugs. State updates happen in transactions so they’re all or nothing. We favor immutable state and INSERTs over UPDATEs. We like simple, functional (or idempotent) code that is easy to debug and refactor.

Caching, by contrast, exposes you to new classes of bugs. That’s why we’re cache-weary: we’ll cache if we have to, but frankly, we’d rather not.

Downsides of caching:

  1. Introduces bugs
  2. Stale data
  3. Performance degradation goes unnoticed
  4. Increased cognitive load
  5. Other tail risks

Caching introduces an extra layer of complexity and that means potential for bugs. If you design your cache layers carefully (more about that later) this risk is reduced, but it’s still substantial. If you use memcached you can have bugs where the service is misconfigured and listens to a pubic IP address. Or maybe it binds on both IPv4 and IPv6 IPs, but only your IPv4 traffic is protected by a firewall. Maybe the max key size of 250 bytes causes errors, or worse, fails silently and catastrophically when one user gets another user’s data from the cache when key truncation results in a collision. Race conditions are another concern. Your database, provided you use transactions everywhere, normally protects you against them. If you try to write stale data back into the database your COMMIT will fail and roll back. If you mess up and write data from your cache to your database you have no such protection. You’ll write stale data and suffer the consequences. The bug won’t show up in low load situations and you’re unlikely to notice anything is wrong until it’s too late.

There are also some more benign problems caused by cache layers. The most common one is serving stale data to users. Think of a comment section on a blog where new comments don’t show up for 30 seconds, or maybe all 8 comments show up but the text above still says “Read 7 comments”. In another app you delete a file but when you go back you still see the file in your “Recents” and “Favorites”. I guess it’s easier to make software faster when you don’t care about correctness, but the user experience suffers.

Many other types of data duplication are also caching by another name. If you serve content from a CDN you have to deal with the same issues I touched on above. Is the data in sync? What if the CDN goes down? If content changes can you invalidate the CDN hosted replica instantly?

Maybe you decide to reduce load from your primary database server by directing simple select queries to read-only mirrors. It’s also a form of caching, and unsurprisingly you have to deal with the same set of problems. A database mirror will always lag behind the primary server. That can result in stale data. What if you try to authenticate against the mirror right after changing your password?

Caching layers can also obscure performance issues. An innocent patch can make a cheap function really slow to run, but when you cache the output you won’t notice. Until you restart your service and all traffic goes to a cold cache. Unable to deal with the uncached load the system melts down.

Things that used to be simple are complicated now. That’s really what it all boils down to. Caching layers complicate the architecture of your system. They introduce new vectors for serious security vulnerabilities, data loss, and more. If you don’t design your caching abstractions carefully you have to think about the caching implications every time you make simple changes or add simple features. That’s a big price to pay when the only benefit of caching is performance.

Alternatives to caching

  1. Buy faster hardware
  2. Write smarter algorithms
  3. Simplification
  4. Approximation

Buying faster hardware is an underrated alternative. Really fast servers are not expensive anymore, and dedicated servers are much faster than anything you can get virtualized on the cloud. Dedicated servers have their own downsides, I won’t deny that. Still, performance is just much less of an issue when you have a monolithic architecture that runs on a server with 32 cores and 500gb of ram and more NVME storage than you will ever need.

Algorithmic simplifications are often low-hanging fruit. Things get slow when data is processed and copied multiple times. What does a typical request look like? Data is queried from a database, json is decoded, ORM objects are constructed, some processing happens, more data is queried, more ORM objects are constructed, new data is constructed, json is encoded and sent back over a socket. It’s easy to lose sight of how many CPU cycles get burned for incidental reasons, that have nothing to do with the actual processing you care about. Simple and straightforward code will get you far. Stack enough abstractions on top of each other and even the fastest server will crawl.

Approximation is another undervalued tool. If it’s expensive, for one reason or another, to tell the user exactly how many search results there are you can just say “hundreds of results”. If a database query is slow look for ways to split the query up into a few simple and fast queries. Or overselect and clean up the data in your server side language afterwards. If a query is slow and indices are not to blame you’re either churning through a ton of data or you make the database do something it’s bad at.

Our approach

The kind of caching we think is mostly harmless is caching that works so flawlessly you never even have to think about it. When you have an app that’s read heavy maybe you can just bust the entire cache any time a database row is inserted or updated. You can hook that into your database layer so you can’t forget it. Don’t allow cache reads and database writes to mix. If busting the entire cache on a write turns out to be too aggressive you can fine-tune in those few places where it really matters. Think of busting the entire cache on write as a whitelisting approach. You will evict good data unnecessarily, but in exchange you eliminate a class of bugs. In addition we think short duration caches are best. We still get most of the benefit and this way we won’t have to worry that we become overly reliant on our cache infrastructure.

We also make some exceptions to our cache-weary attitude. If you offer an API you pretty much have to cache aggressively. Simply because your API users are going to repeatedly request the same data again and again. Even when you provide bulk API calls and options to select related data in a single request you’ll still have to deal with API clients that take a “Getta byte, getta byte, getta byte” approach and who will fetch 1 row at a time.

As your traffic grows you’ll eventually have to relent and add some cache layers to your stack. It’s inevitable but the point where caching becomes necessary is further down the road than you’d think. Until the time comes, postpone caching. Kick that can down the road.

DIY javascript error logging

There are many SaaS products out there that help you with javascript error and event logging, but in this blog post I want to make the case for rolling your own solution.

We log 3 types of events: (1) javascript exceptions with stack traces, (2) failed assertions, and (3) general usage/diagnostics information.

Exception handling

We can use a global event handler to log exceptions. This used to be somewhat difficult, but nowadays window.onerror works great. The browser gives you everything you need. Straight from the mozilla docs:

You can even get a pretty good stacktrace with Error.stack. It not part of the official web standard, but it works on all major browsers and that’s good enough. Once you’ve collected all data you want to log you can just send it to your server with an ajax request. Alternatively, you can use an <img> tag. Something like this works just fine:

let errimg = document.createElement('img');
errimg.src = '/jserror.png?e=' + encodeURIComponent(JSON.stringify(obj));
document.querySelector('body').appendChild(errimg);

One thing to watch out for is that GET requests can get truncated. You also want to make sure that you don’t log errors when you’re already in your error handler (otherwise you’ll DDOS yourself :)) and you probably want to drop errors you’ve already reported. Reporting an exception once per session is enough for debugging purposes.

What metadata you want to log is up to you, but we find it useful to log these things:

  • Username, account name. If you find a bug and fix it you want to tell people you fixed the bug, but you can’t do that if you don’t know which people got the error message.
  • Browser version. Helps when you want to replicate the bug. This was super important back in the IE6-9 days, when you had to make tons of browser-specific workarounds. Nowadays you mainly want to know if people are using a really old or unusual browser.
  • Javascript app bundle version and page load timestamp. Some people keep their browser open for weeks at a time and you don’t want to waste hours trying to replicate a bug that has been fixed ages ago.
  • Adblocker usage. Add a <div> with a bunch of spammy keywords to your page. Use setTimeout to check the boundingRect of that node a couple seconds after your page has finished loading. If the node is gone, you know they have an adblocker installed.

Be careful not to log anything that could contain customer data. Easier debugging is great, but not when you have to compromise your customer’s privacy to do it. It’s fine to log counts, IDs, and checksums. If you can’t figure out how to replicate the bug with only a stack trace to guide you then you can always add more asserts to your code and wait for one of them to trigger.

Assertions

To debug exceptions you only have a stack trace to work with. Debugging is a lot simpler when you make liberal use of assertions in your clientside code. You can use the same error logging code you use for exceptions, but asserts can log some extra diagnostics variables.

Usage tracking

Every time you add a new feature to your product you want to track if it gets used. If not, figure out why not. Is the feature too hard to discover? Do people just not care about it? Adding a tracking hook takes 1 minute, but the insights you get are invaluable.

Our rule of thumb: we get an email notification every single time a new feature is used. If the notifications don’t drive us nuts that means we built the wrong thing. This really helps us calibrate our intuition. And it’s motivating to see the notifications flow in right after you push to production!

You also want to track how often users get blocked by your software. Every time a user wants to do something but they get a “computer says no!” message they get a little bit unhappy with your software. They upload a file and it doesn’t work because the extension is wrong or the file is too large? Log it and fix the problem. Sometimes the fix can be as simple as telling users the file is too large before they have uploaded it. Instead of a simple “access denied” error see if you can make the error more helpful. You can add a button “ask administrator (name) for permission”. Measure which problems users run into fix them one by one.

Serverside

We take a whitelisting approach. We get email notifications about everything to start with. Then we add filters for all the errors we can’t do much about. Errors caused by connection timeouts, errors caused by virus scanner browser plugins, things like that. Every javascript exception potentially breaks your entire site for some users. That means every exception is worth investigating. You’ll inevitably discover your site breaks when an ajax POST times out, or when a dependency fails to load. Or when an adblocker removes some DOM nodes. No matter how well you test your software, your users will find creative ways to break it.

You can also use feature usage tracking for spam/fraud detection. If your SaaS service is inexpensive it will be used by credit card fraudsters to test if their stolen credit cards work. You can easily distinguish between real users and bots or fraud signups by comparing some basics statistics on feature usage and which buttons have been clicked on.

If you use a 3rd party service for error logging you can’t cross-reference data. You can’t easily discover which features get used by people who end up buying vs by trial users that fizzle out. If you have subscription data in one database and usage/error tracking in another database querying gets complicated, so you won’t do it.

Another reason why we want to do our own event logging is that we might accidentally log something that contains sensitive data. Our own logs rotate automatically, but 3rd party logging/event services will hang on to that data indefinitely.

Writing your own javascript error/event logging code isn’t much work and it will give you valuable insight in how people try to use your software and the bugs they run in to.

Don’t let mistakes cascade

You’re in the kitchen, preparing a meal. You’re in a hurry and you’re hungry. So you move fast. You grab a plate from a cupboard but you drop it and it shatters. You bend over to pick up some shards and as you get up you hit your head on the cupboard door you left open. You curse and rub your head. As you walk to the trash bin to throw away some of the broken ceramic you notice a pot starts boiling over. You rush back to the stove to turn off the heat and step on a shard you hadn’t picked up earlier. Now your foot is bleeding. You want to move the pot from the stove but there is no available counter space. You try to shove it on there anyway. A plate pushes into a cutting board that in turn pushes into a couple of glasses that were precariously placed right next to the sink. They fall in, and break.

It’s 2am and your phone buzzes. You see a notification your app is down. You’re confused and wonder if it’s a false alarm, but you look at your email and you see a bunch of angry messages. Oh crap. You’re exhausted and groggy, but you open your laptop look at some logs. 500 errors everywhere. You realize yesterday’s feature update is the problem. You revert the code but the database is now newer than the app expects, and the ORM doesn’t know how to deal with the new columns. Now your app service doesn’t start at all anymore. You decide to drop the columns you added yesterday, but in your haste you drop the wrong column from the database and now you’re in a lot of trouble. Do you restore the database from backups? How old are the backups, do you accept the data loss? How long does restoring from backups take anyway? Do you want to restore only the missing column from backups? How long will that take? How will you fix all data inconsistencies? It’s now 2:30am, you can barely think straight, everything is down, your database is broken, and all your options look terrible.

This Is Fine GIF

These are stories of cascading mistakes. With one unforced error after another even something small can turn into a major headache. But errors don’t have to compound like this. If you just take a moment to stop and think these problems almost disappear. Imagine this, instead:

You’re in the kitchen, preparing a meal. You drop a plate and it shatters. You stop and pause for a full 10 seconds. Ask yourself what your next action should be. Answer: turn off the stove. Close the cupboard. Move things out of the way you might bump into. Then slowly clean up all the shards. Then stop for 10 seconds and ask yourself if you forgot something else. You decide to free up counter space by loading up the dishwasher. Then you resume cooking. Total delay? Maybe 5 minutes. Really no big deal.

It’s 11 at night and you finish a feature you’ve been working on. You’ve tested it, and it looks OK. You decide it’s too late to push to production. Certainly too late to do a database migration. The next morning you make some coffee and launch your feature. Everything seems fine, but after a couple of minutes you notice something weird in the error logs. A customer emails asking if the service is down. Uh-oh. You think for a minute and decide not to do a full rollback — you already migrated the database after all — but decide instead to stub out the feature. You only have to change one line of code. You reply to the customer with an apology. You fire up your dev VM and fix the bug. Looks good. Push to production. Email the customer again to inform them the problem is resolved. You’re not happy about the bumpy release, but it really wasn’t so bad.

Everybody messes up sometime. We do, too. But we’ve never had significant downtime. Never lost customer data. Never had a database migration go badly wrong. In part it’s luck, but in part it’s because we try hard not to make bad things worse.

  1. when something breaks the first thing we do is stop and reflect
  2. then we diagnose
  3. then we stop and think how the fix might backfire on us
  4. then we ask ourselves if the fix is something we can roll back if need be
  5. then we stop again to think of an easier fix
  6. and only then do we apply the fix and test if it worked

Afterwards, we look for ways to eliminate the root cause of the problem. In the case above, it’s better to release the database migration and the feature separately. That way you can roll back a buggy feature without even thinking about it. Additionally, you want to feature flag complicated new features. That way you can gradually release features in production, and when trouble arises you can just turn the feature off. It takes basically no extra effort to take these precautions, and they’ll save you a lot of time and aggravation when you do something dumb like pushing to production right before you go to bed.

Some more lessons we learned the hard way about cascading problems:

  1. Don’t do routine server maintenance when you’re in a hurry, when tired or distracted. Regular maintenance should only take a few minutes, but you have to be prepared for things to go very wrong. If you do maintenance at 11pm you risk having to work throughout the night and that’s just asking for mistakes to compound. Maintenance work is easy, but you want to do it in the morning when you’ve had your coffee and you’re fresh.
  2. Don’t hit send on emails when you’re tired or annoyed. It’s OK to let a draft be a draft and you can hit send in the morning.
  3. Have local on-machine nightly backups of all config in /etc/, deployment files, and everything you might break by accident. If you do something dumb and need to roll back in a hurry nothing beats being able to restore something with cp.

    Config backups like these saved me twice: one time I deleted a bunch of files in /etc on a production server that were necessary for the system to boot. Figuring out which Debian packages corresponded to the missing files is tricky and besides, the package manager won’t run if /etc is missing. Print directory structure with find for /etc and /backup/etc. Use diff to see which files are missing. cp -arf to restore. Use the -a (archive) flag so you restore user, group, access permissions, and atime/mtime along with the files themselves.

    Another time our JS compressor crashed and output partial data (on perfectly valid Javascript input no less) and there was no quick way to diagnose the problem. Our entire app was effectively down, so I needed a quick fix. It’s at those times that you really appreciate being able to restore last night’s JS bundle with a single copy command.

    You need a simple backup system that works every time. Some people put /etc under version control, but this isn’t great because every server is at least somewhat unique (e.g. /etc/hosts, ip bindings). Nightly backups that allow you to simply diff and see what changed will never fail you. Many backup systems try to be too clever and you need to google command line options for basic operations. rdiff-backup gets almost everything right, although it breaks if you try to back up too much data.
  4. Learn how to boot from a rescue environment and chroot into your system. chroot is basically magic. You can use this to fix your boot partition, broken packages, broken firewall/network config, mangled kernel installs and more.

    We’ve only had to use this trick twice in the last 10 years. If a server doesn’t come back after a reboot and you start sweating, that’s when you need to keep your head cool and do some quick diagnostics. You can fail over, but failover is not without risks, and if you have no clue what happened you don’t know if the failover server(s) will collapse in the same way. Downtime sucks, but you always have 5 minutes to do some preliminary diagnostics and to think carefully about the next steps to take.

The lesson here is so simple and also one of the hardest ones for me personally to learn:

Slow down. Breathe. Don’t make things worse. Consider your options before acting.

Mistakes are unavoidable, but if you don’t let small mistakes cascade into something bigger you’ll notice you’ll spend very little of your time putting out fires.

Assert all the things

In programming, we use assertions all the time to make sure our assumptions about what the code should do are (still) valid. Whenever the resulting state is not what we expect it to be, the program stops and we receive some sort of error notification.

We’ve set up our systems to email us about it, so in the below example we would get an email if some_function no longer works as we expect it to:

offset = some_function()
assert offset % 4096 == 0, "The resulting value should be a multiple of 4096"

We use this concept of assertions everywhere in our business. Some of those are in code, but many others run as periodic scripts (cronscripts). They’re small bots if you will, constantly checking if all kinds of “business state” is still what we expect it to be. Those checks can be about all kinds of things: from accounting, to hardware, to invoices and security.

Rather than having to manually check if everything works as expected, we have our systems tell us when something is wrong. Push vs pull. It’s like the concept of a dark cockpit in modern planes. Rather than staring at dozens of lights and graphs and make sure they have the right color and right value, we keep working without distractions and get notified when a potential problem arises.

The checks don’t do anything other than informing us through an (email) notification. We don’t add “smart” logic to automatically try and correct the problem. The entire point of these checks is that the state doesn’t match our expectations, so those events will be rare and require investigation. At best it’s premature optimization to worry about a solution before, and worst case automatically correcting it based on assumptions which no longer hold cause even more problems.

Over the years we’ve added plenty of these checks in our existing systems, and we will definitely use them again this time. They’ve saved us a lot of headaches by catching issues early on, and quite often the fix was trivial. Some examples:

A loose network cable

We run several scripts on the server where we expect the output to be empty. For example, we run a command like the one below to check the status of the network.

$ ethtool eth0 | egrep 'Speed|Duplex' | egrep -v '(10000Mb/s|Full)'

When the output is empty everything is as it should be. When there’s output, we get an email about it. At one point we got the following email message:

Speed: 1000Mb/s

Because we found out right away it was easy to trace the problem down to a bad cable which was causing network errors in our data center.

Server configuration

Just like the example above, we have many more of these little bots checking all kinds of configuration. Sometimes we hash the entire output to make it easy to check if the output still matches. For example, this command checks if all services are running with the exact status we expect them to have:

$ systemctl -a | md5sum | grep -v 36adc2b4e6a5798c9a8348cfbfcd00e0

When there’s output, something’s off. We do the same for all kinds of other configuration (maximum number of open files, kernel modules, IPv6 support, firewall rules etc). This has especially saved us time when updating packages or the OS itself. Even when you read the changelog, some subtle changes in defaults or configuration file locations might break your assumptions.

We also have scripts checking configuration files against typos or security issues (such as constantly checking our web server config for common misconfigurations). If we would ever edit the config and make a mistake, we’ll get an email a minute later.

Stripe webhooks

Many APIs have a tendency to change over time. Of course you try to keep everything updated, but there’s no guarantee you’ll catch all unexpected behavior or that you have time to look into every new change.

We use Stripe to process our credit card charges. Obviously it’s important this works correctly. To make sure we’re not missing any important billing state or changes we didn’t even know existed, we automatically send ourselves an error message whenever we receive a webhook we didn’t know about at the time we wrote the code.

Manual invoices

Although we process most invoices automatically using Stripe, many larger B2B accounts often pay using wire transfers. Automating every last bit isn’t always the right trade off, but we’re also not going to have staff to waste time on making sure invoices get paid. So we have a simple bot which notifies us every week if an unpaid invoice has a due date in the past. It keeps emailing us every week so we can’t forget.

Accounting

In 2020, some EU countries temporarily lowered their VAT rates for a few months because of the pandemic. Who knew? We certainly didn’t. Luckily, we didn’t have to, because our VAT-rates-bot told us on the 1st of the month that something was off.

All our other accounting is code, too. We can calculate down to the cent that the balance of our accounts should be equal to the sum of the opening balance and all transactions and invoices we’ve recorded since. The strangest error we’ve discovered this way is that someone on the other side of the world accidentally paid their speeding ticket to our account instead of the traffic authorities. You can’t make this stuff up, but luckily our bots told us about it.

Feature limits

Sometimes you introduce a feature where you have to make assumtions on how it’s going to be used. For example, we allow users to create workspaces when they work in multiple teams. For the first version, the UI won’t be very advanced. Will it work well with 5 teams? Sure. Will it work with 100 teams? Maybe not. But why worry about something that might never happen. We put in a “soft assert” (which warns us but lets the user continue), and investigate what to do about it when we start seeing people go over limits.

~~~

And of course the list goes on. Sometimes new problems come up, or we read about issues which might be relevant for us too, and we add a little check. Set and forget. Assert all the things.

Insecure defaults considered harmful

Anything that involves input or output should not just be considered unsafe but actively hostile, much like the critters in Australia. The programming languages and libraries you have to use are not designed with security in mind. This means you have to be totally paranoid about everything.

Let’s go over a few places where you have to deal with insecure defaults.

Zip archives

Suppose you add some backup feature to your app where users can download their files as a .zip, or maybe you let users upload a theme as a zip file. What could possibly go wrong?

Let’s start with the zip-slip vulnerability, allowing attackers to write anywhere on the system or remotely execute programs by creating specially crafted zip files with filenames like "../../evil.sh". This kind of attack made a big splash on the internet a couple of years ago. Many archive libraries were affected, and with those many libraries probably thousands of websites.

Most programmers will just use a zip library and not think hard about all the ways it can blow up in their face. That’s why libraries should have safe defaults. Relative paths should not be allowed by default. Funky filenames should not be allowed (e.g. files with characters in them, like backslashes, that are forbidden on other platforms). Because the libraries don’t do these checks for you, it’s up to you to reject everything that looks sus. Use of unicode should be highly restricted as well by default, more about that in a bit.

Zip exploits have happened before of course. Take Zip bombs for instance. Zip bombs are small files when zipped but get huge when decompressed. Zip bombs are at least 20 years old, and yet I don’t know of a single zip library for any programming language that forces the programmer to even think about the possibility that unzipping a small file can fill up all disk space on their server thereby crash the whole thing.

It’s pretty strange, when you think about it. In most cases the programmer knows, within an order of magnitude, what a reasonable unzip size is. It might be 100mb, it might be a gigabyte or more. Why not force the programmer to specify what the maximum unzip size should be?

When your unzip library doesn’t enforce limits you have to get creative. You can unzip to a separate unzip partition that is small. That way any unzip bombs will detonate harmlessly. But really, is it reasonable to go through this trouble?

It’s not just about disk space. You want limits for all system resources. How can you limit how much memory can be allocated during the unzip process? How can you limit how much wall time you’re willing to allocate? You can also use ulimit or a whole virtual machine, but that introduces a great deal of extra complexity and complexity is another source of bugs and security vulnerabilities.

Unicode

Unicode is the default for anything, and in the coming years we are going to see many creative unicode exploits. In the zip example above all filenames and file paths are unicode that can contain, among many other things, funky zero-width characters:

Unicode characters can trip you up in many ways. Suppose you have a webapp where people log in with a username. What could go wrong when you allow zero-width spaces inside usernames? It can go very wrong when you strip whitespace inconsistently.

For example, during registration you only strip ascii whitespace (space, tab, newline, etc) when checking if a user with that username already exists, but you strip all unicode whitespace when saving the user to the database. An attacker can exploit this collision by registering a new user with zero-width space added to the username of the victim. Two users rows will be returned by a login query like this:

SELECT * FROM users WHERE username = 'bobbytables' AND pw_hash = 123

And databases typically return the oldest row first if no sort order is given, meaning the attacker has just logged on as the victim using his own password.

Layered paranoia helps here. First select the user row based on the username. If two rows are returned, bail. Only then validate whether that row matches the given password. You also want to use database uniqueness constraints so you can never end up with two rows in your user table with the same username.

XML – SVG

XML libraries support External Entities. Basically, you can upload an innocent looking XML file, and when parsed it includes a different file, like /etc/password, it can even allow full remote code execution.

A famous example here is ImageMagick, a popular graphics library used to create thumbnails for images. Upload a malicious image, BOOM, remote code execution (ImageTragick). This vulnerability existed for many years. ImageMagick was never intended to be used to process untrusted images passed through via web services. It’s just a product of a different era.

Any time you deal with XML files (or XML adjacent formats) you have to specifically check if the file format supports remote includes, and how the library deals with it. Even if remote includes just involve HTTP requests, and not access to your file system, you might still be in trouble. If you download the contents of a URL on behalf of a user the HTTP request is coming from inside your network. That means it’s behind your firewall, and if it’s a localhost request, it might be used to connect to internal diagnostics tools.

Maybe your http server runs a status package, like Apache Server Status. This page lists the most recent access log entries, and is accessible by default only from localhost. If a localhost access rule was your only layer of defense you’re now in trouble. Your access logs can contain sensitive info like single-use password-reset tokens.

User uploads malicious SVG file -> ImageMagick resolves External Entity and fetches Access Log via HTTP -> Renders to PNG and displays to user as thumbnail.

It’s hard to predict in advance how innocent features can be combined into critical security failures. Layers of defense help here. Limit what kind of image files can be uploaded. Google for best security practices for the libraries you use. Most foot-guns are well known years before the big exploits hit the mainstream.

Regular expressions

Regular expressions are great, but it’s still too easy to introduce serious denial of service vulnerabilities in your code by being slightly careless. We’ve been burned by this a few times. A simple regular expression that caused no trouble for years suddenly eats gigabytes of memory. Because of the memory pressure the linux OOM killer decides to kill memcached or the SQL server and then everything grinds to a halt.

What kind of regular expression can cause that kind of mayhem? Easy, one that looks like this: (a|aa)*c

A regular expression by default tries to find the longest match. This can result in exponential backtracking. For more reading see ReDoS on wikipedia. If you make it a habit to look for shortest matches, using *? instead of * you’re much less likely to write an exploitable regular expression. If you also validate input length and keep your regular expressions short and simple you should be fine.

Regular expressions are incredibly useful. Still, regular expression engines would be way more useful if you could give them a time and memory budget. If a regular expression is intended to take a few milliseconds I want an exception thrown when it takes 2 seconds instead. If a regular expression starts allocating memory I want to know about it, before it turns into a big problem.

Working towards good defaults

Take this gem of a comment from the golang zip library:

Instead of validating input and providing a safe API by default the library just pushes the responsibility onto the user (the programmer in this case) who is likely to be either too inexperienced or too pressured by deadlines to take all the necessary precautions.

The point here isn’t to suggest Go is bad, but that times have changed and most software written today has to survive in a hostile environment. The only way forward is to move to secure defaults everywhere, with flags for unsafe mode for those cases where you trust the input. It will take long time before we get there, and until then you need many layers of security.

The advantages of developing in a dev VM with VSCode Remote

Now that we’re both working on a lot of code, need to keep track of versions, and also need to start working on a backend, it’s time to set up our development environment.

Advantages of doing all development in a local VM

All our source code will be stored on our Git server. Locally we both use a development Virtual Machine (VM) running on our laptop/PC, on which we will check out the (mono) source repository. We will use ES6 for the frontend and python3/Django for the backend, but this works pretty much for any stack. Using a local development VM has several advantages:

  • We’ll both have identical set-ups. Diederik uses Windows most of the time, I use a Mac machine. It would become a mess if we tried to work with all kinds of different libraries, packages and framework versions.
  • Easy to create backups and snapshots of the entire VM, or transfer the entire development setup to a different machine (like in case of any coffee accidents).
  • It avoids the mess of having to install packages on our local PCs and resolving conflicts with the OS. For example, the Python version on my macOS is ancient (and even if it wasn’t, it’s probably not the same as on the production server). Trying to override OS defaults and mess with package managers like brew is a mess in practice. It’s slow, breaks things all the time and adds a lot of extra friction to the dev stack.
  • Not only do we avoid local package managers, but also the need for other tools in the stack like Python’s virtualenv. We don’t have different virtualenvs, only the env, which is the same on our VM as on the production server.
  • So not only will the packages be the same between our development environments, they will even be identical to the eventual production server (which we don’t have yet, we will have to set one up later). This way we don’t need anything like a staging server. Except for having virtual hardware, debug credentials and test data, the development VM will mimic the complete CALM production stack.
  • Because of built-in support for remote development in VSCode (which is still local in this case, but on a local VM), all VSCode plugins are going to run with exactly the language and package versions we want. No mess trying to configure Django and Python on macOS with a different OS base install. All plugins will run on the VM, so we’ll also have IntelliSense code completion for all our backend packages and frontend parts in our stack.
  • That also means that we can not only debug issues in the app, but issues in the stack as well, from nginx web server config to websocket performance.

Setting up the VM

I lile to use vagrant to easily create and manage virtual machines (which you can use with different providers such as VMWare or VirtualBox). To set up a new Debian Linux based VM:

# see https://app.vagrantup.com/debian
pc$ vagrant init debian/bullseye64
pc$ vagrant up
pc$ vagrant ssh 
# now you're in the VM!

In the resulting Vagrantfile you can set up port forwarding, so running a Django development server in your VM will be accessible on the host PC.

Because vagrant ssh is slow, you can output the actual config to ssh into your machine using

pc$ vagrant ssh-config

and then store this in ~/.ssh/config (on the local PC), so it looks something like this:

Host devvm
  HostName 127.0.0.1
  User vagrant
  Port 2020
  UserKnownHostsFile /dev/null
  StrictHostKeyChecking no
  PasswordAuthentication no
  IdentityFile /Users/wim/devvm/.vagrant/machines/default/virtualbox/private_key
  IdentitiesOnly yes
  LogLevel FATAL

To make sure we both have the same packages installed on our VMs (and server later on), we usually create Ansible playbooks (or just a simple bash script when it’s a few packages and settings). We also store our infrastructure config in git, but we’ll go into all of that some other time.

For now, we can just use a very short script to install the packages for our stack:

vm$ apt-get install aptitude && aptitude update && aptitude full-upgrade
vm$ aptitude install python3-pip python3-ipython sqlite3 git
vm$ pip install Django==3.2.12

Now we just need to add our git credentials in the VM’s ~/.ssh/config:

Host thegitserver
  HostName thegitserver.example.com
  User gituser
  ForwardAgent yes
  IdentityFile ~/.ssh/mygit_id

and check out the repository on the VMs drive:

vm$ mkdir project && cd project
vm$ git clone ssh://thegitserver/home/gituser/project.git

Remote VSCode

Now that all the packages and sources are on the VM, we can set up VSCode on our local machine to work on our workspace on the VM and run extensions on the VM as well (see VSCode’s Remote Development using SSH).

1- Simply install the Remote – SSH extension:

2- Open the Command Palette, and >ssh should show the option Remote-SSH: Connect to Host.

3- Select that, and it should show the Vagrant’s SSH config we saved in our PC’s ~/.ssh/config earlier under the name devvm.

4- You’ll see we’re now connected to the devvm. Click Open Folder to open the remote workspace, ~/project in our example.

5 – The last step is ensuring all the extensions we need are installed to the VM. Click on the Extensions tab, look for the extensions you want, and click “Install in SSH”

6 – That’s it! In the screenshot you can now see the plugins are installed on the VM, the repository workspace from the VM is opened as our current project, and git integration works out of the box.

We can even use extensions like the Live Server on our ES6 frontend code like before, and run our Django API development server on the VM knowing we have all the correct packages.

Setting up a very basic git server

Just yesterday gitlab was down. Github has network issues on a pretty regular basis. Imagine not being able to push an update to your product because some 3rd party service is down. No thanks! We’ll set up our own git server. Shouldn’t be difficult.

Our general philosophy is to do as much as possible ourselves, for 3 mains reasons. One, we learn a bunch and this will help us troubleshoot when something goes wrong down the road. Two, we enjoy not being dependent on 3rd parties. Three, having your own stuff that never breaks with APIs that never change makes life way better. Yes, sometimes we reinvent the wheel, but that’s alright.

We’re two people and we build small scale apps (couple million users max). This means we don’t need much in terms of infrastructure. We don’t need even 5% of the functionality github has to offer. When we have specific needs we can often duck-tape a handful of Linux command line utilities together. In that spirit we’re setting up our own git server.

Goals: (a) create new repositories easily. (b) push/pull from VSCode. (c) push/pull to release server. (d) email hook on push.

I’m just going to follow the guide on git-scm.com. I’ll create a user called ‘git’ on our server, and set up public key authentication, with blockers for port forwarding. We have a whitelist for ssh logins, so I’m also updating AllowUsers in /etc/ssh/sshd_config. With chsh I’m removing shell access for the git user as well.

If you lock everything down as aggressively as possible then you’re never one configuration file typo away from disaster. We have a firewall that whitelists IPs, a secondary firewall on a switch in the data center, we block users in sshd, disallow password authentication, we disable shells, we use fail2ban to ban/alert on suspicious activity, all sorts of monitoring and we probably have additional security measures I can’t think of right now. We’re big believers in this kind of layered security and I’m sure it will be the subject of future posts.

Now I’m going to deviate a little bit from the git-scm instructions. One, I want to rename the main branch to ‘main’. We can do that with git symbolic-ref HEAD refs/heads/main. Future versions of git will make renaming the main branch easier, but this works.

I’ll also add a simple ‘post-receive’ hook so when any commits are pushed to the git server it’s posted to our wiki and we’ll get a nice email about it.

Basic bash script gets the job done. This is not a robust script that is intended to stand the test of time. We think of it as a type of interactive documentation. When we want to create a new git repository a year or two from now and we forget the steps we can just read the script, ask ourselves if it still looks reasonable and then run it.

It’s a good habit to sanity check your inputs, even on throwaway scripts. It’s easy to shoot yourself in the foot with bash shell expansion, after all.


#!/usr/bin/bash

# usage: ./make-repo myrepo
#
# for git commands see
# https://git-scm.com/book/en/v2/Git-on-the-Server-Setting-Up-the-Server

# http://redsymbol.net/articles/unofficial-bash-strict-mode/
set -euo pipefail
IFS=$'\n\t'

if [[ "$1" =~ [^a-zA-Z0-9_-] ]]; then
        echo "use alphanum git repo name '$1' (exit)"
        exit
fi

sudo -u git mkdir /home/git/$1.git
cd /home/git/$1.git
sudo -u git git config --global init.defaultBranch main
# only repo, not also a checkout
sudo -u git git init --bare
# debian git doesn't have rename head yet
sudo -u git git symbolic-ref HEAD refs/heads/main
# papyrs wiki hook
sudo -u git ln -s /home/utils/git_receive_hook.py hooks/post-receive

Now it’s just a matter of adding the remote to my local git repository and we’re off to the races:

git remote add origin ssh://80daysgit:testproj

And I’ll add an entry to my ~/.ssh/config:

Host 80daysgit
  HostName [redacted]
  User git
  ForwardAgent yes
  IdentityFile ~/.ssh/80daysgit_id_rsa

That’s it. The remote shows up in VSCode automatically and I can push/pull with a click of a button.

Metal with a sprinkle of cloud: our CALM server stack

Which stack to pick is a recurring topic, and we’re big believers in how dedicated hardware [1] is (still!) a great option, especially for bootstrapped (SaaS) startups.

A few people pointed out we use a few cloud services as well, so which is it, cloud or metal? And what exactly does our favorite stack look like then?

I like to call our stack CALM. The purpose of CALM is in the name: we want the most bang for buck while keeping things simple and boring. We’re a tiny team and want our SaaS to scale to many customers (potentially millions of users) without having to worry about complexity and cost. Let’s look at the different parts:

1U ought to be enough for anybody!

If you’re puzzled by the “one machine” part: yes, that’s how ridiculously cheap and powerful metal is these days.

For ~$150/month you can get an absolute beast of a machine, with 128GB of RAM, 2 TB SSDs and dozens of 5Ghz cores. You don’t pay for each byte of traffic, each cycle of CPU or a few GBs extra of RAM. It makes tiny cloud compute nodes look like a toy. It’s powerful enough you can run everything on one machine for a long while, so your architecture becomes super simple as this point. When things start to grow further, we’ll have plenty of RAM to cache queries and dynamic parts of the app.

Sprinkle on the cloud!

This already is enough to handle many thousands of users, no problem. But to make this even more robust, this is where we like to sprinkle a bit of cloud on top:

By using CloudFlare as a reverse proxy, we significantly reduce the number of requests which hit our server. It’s literally set and forget. We set up the CloudFlare proxy in their dashboard and firewall everything else off. It doesn’t add any real complexity to our stack. And it’s cheap (even free for what we need). We can cache and serve static assets from CloudFlare from a location close to the user. There’s really a lot of traffic you can serve this way.

Reliability through simplicity

Servers these days are super reliable. I like to compare it to ETOPS for planes. You no longer need 4 clunky engines, if you use new hardware.

By design, the entire stack has very few moving parts. It’s easier to set up, simpler to maintain, and cheap. The more parts, the more can break. No 3am magic automatic migrations, noisy neighbors, billing alerts, or having to refactor your app because some inefficient query is eating up all your profits in cloud costs, none of it.

Another advantage is that a small dev VM on our laptop can mirror the entire architecture, so even if we want to make changes to the server set up (which we hardly ever do), we can test the exact thing on our laptop.

We’ve never had significant outages but of course, catastrophe can always strike, and that’s a fair point (and the cloud is not immune to these either!). You can find examples of an OVH datacenter literally catching fire. It’s very rare, but we want to be prepared!

When starting out we simply stored a backup of the entire server at a different location (again, cloud sprinkled on top). That’s OK, but in a worst case disaster we would have to spend half a day to restore things, which isn’t very calm. Because servers are so cheap, we’ve now simply added a completely identical mirror as hot standby in a datacenter in a different part of the world.

If something goes terribly wrong, we simply point the DNS to the other server. And in an even worse case we could even completely switch hosting providers and move our entire architecture somewhere else (not that we’ve ever had to deal with problems like these).

CALM

So there we have it, our CALM stack: Cloudflare (or other CDN/proxy) -> App -> Linux -> Metal.

Of course, some percentage of startups is going to have special requirements (petabytes of storage, 99.99999% uptime) and this won’t work for everyone. But when your database is only a hundred gigabyte or so and the whole thing fits into RAM, you don’t need to worry about “big data” problems. Similarly, lots of websites and apps can get away with a minute downtime per month (if that happens at all).

It will be a long time before we outgrow this architecture, which only costs about $500/month, and that means we can worry about the product and customers instead.

[1] Earlier this week, we wrote about how You Don’t Need The Cloud.

You don’t need the cloud

Putting your web service on the cloud is the default choice nowadays. Who in their right mind still uses dedicated machines these days?

Well, us. We do! We’ve used dedicated machines for our startups since 2006. AWS back then was kind of expensive and slow, and we were broke. We used low-budget dedicated machines out of necessity. Fast forward a decade and a half and we still use dedicated machines for almost everything[1], even though the cloud has gotten a lot cheaper and cost is not an issue anymore.

The reason is simple. Dedicated machines offer unparalleled performance and are rock solid.

This is the pricing table of hetzner.com, a popular German hosting provider.

You can upgrade the NIC if you want a 10 gigabyte uplink. You can set up your own private network if you want. You get a free firewall, a remote reboot console, IPv6 IPs, ECC ram, plenty of bandwidth, separate storage for backups. Pretty much everything you can think of.

You can serve an unbelievable amount of traffic with a single dedicated machine with a few hundred gigabytes of ram and a dozen cores running at 5ghz. Most web apps don’t even do that much. It’s JSON in and JSON out. Most traffic will hit a handful of endpoints that makes optimization/caching easy.

You can use Cloudflare to serve all your static content quickly. If you want, you can proxy all your backend traffic through Cloudflare as well.

You can use simple linux tools for backups. We think rdiff-backup is incredible. But there are plenty of alternatives. Doesn’t even take long to set up.

What does AWS have to offer?

  • S3. Unless you store a massive amount of data you don’t need it. And if you want to get it out, be prepared to shell out for egress fees.
  • Route53. It’s fine, but Cloudflare offers DNS hosting as well.
  • Identity management. You don’t need it.
  • RDS. It’s slow and super expensive. I don’t want to worry about one-off queries or migrations. If you use only 5% of your server capacity you can afford to do inefficient things at times.
  • Billing surprises. Today I read another one of those. I don’t want to worry about this stuff.
  • Lambda? Beanstalk? Glacier? EBS? What even is all this stuff.

If you’re a startup you want to focus on your product, and all this cloud tech is just a distraction. AWS is for big companies with a lot of turnover that can’t run their own machines. Unless your startup needs so much compute or storage that running your own dedicated machines becomes unmanageable I think you should just say “no” to all of it.

[1] This blog runs on the cloud but that’s because WordPress is a security nightmare and a cloud VM is the easiest way to sandbox it. Ironic, I know.

A no-nonsense server architecture for group based SaaS

If you’re never built a SaaS product before all the server-side stuff might seem overwhelming. How many servers do you need? Magnificent monolith or microservices? VPS, app engines, end point computing, dedicated machines? Do you need a CDN? SQL, noSQL, distributed or not, failover, multi-master or RDS? So many decisions! And if you get it wrong it’s going to cost you.

Thankfully, we’ve built SaaS products before which means we’ve got a rough idea of the kind of problems we’re likely to encounter at different levels of usage.

Our guiding principles:

Don’t overspend

The cloud is great in many ways but the virtual machines are underpowered and expensive. Cloud servers make perfect sense when you don’t want to do sysadmin work or when your load is so unpredictable you need to dynamically spin up additional servers. Neither applies for us. A single $100/month dedicated machine gets you 16 cores, 128GB ECC ram, and 8TB of enterprise NVMe SSD storage. That’s a lot of horsepower!

A single server like that can support more simultaneous users than we’re ever going to realistically have. We could probably get a server that’s one third as powerful and not notice the difference. We won’t need to heavily optimize our server code and we won’t have to worry about communication between microservices. A monolith is simple and that eliminates entire classes of bugs, so that’s what we’re going for.

Boring tech is better

We’ll use Debian stable. We don’t mind that all packages are 9 months behind other distributions. We don’t need the latest point release of bash or Python. We want stable packages and regular security updates. That’s what Debian offers.

Django + MySQL is will be the core of our backend stack. Because our SaaS app will be group based (meaning: no shared data between groups) scaling and caching will be very easy. Servers like twitter are complex because every action potentially affects any of the other billion users in the system. But when you make a service where all user groups are independent you don’t get this explosion in complexity. If we suddenly end up with millions of users (yea, right!) and there is no faster server available (vertical scaling) we can still scale horizontally if necessary. It’s just a matter of buying a bunch of servers and evenly dividing our users between them.

We have used Python/Django for years so that’s what we’re going with. We could have picked golang or any other web stack. It doesn’t matter that much.

We don’t need a cache server, a message queue, or any other services at this point.

We should get 99.99% uptime with this server architecture, simply because there is so little that can break. Provided we don’t do anything dumb :). All downtime is bad, of course, but there are diminishing returns at some point. I think 1 minute of downtime per week is acceptable.

Security Paranoia

We’ll cover this in depth in future posts. Firewall, sshd policies, full disk encryption, and intrusion monitoring are core parts. The mantra is: layers of defense. If system files get changed, we want to know about it. If a program listens to an usual port, we want to know.

We’ll do backups with rdiff-backup (local + remote). We can use it to diff what changed on a server, which means it also functions as an audit tool. What makes rdiff-backup so cool is that it computes reverse diffs, as opposed to forward diffs. This means the latest backup set consists of plain files that can easily be verified to be correct. In addition there is some compressed archival that allows you to go back in time for individual files/directories. You never have to worry about the archival system getting corrupted.

For full disk encryption we’re going to need to employ some tricks. We want to encrypt everything, including the OS. For a remote headless server that requires some ingenuity. More about this later.

Mailgun

Mailgun is what we know so that’s what we’ll use for our planning IDE Thymer. Mailgun has always had reliability issues though and deliverability is mediocre, even when using a dedicated IP address. We can always switch transactional email providers at some later point, so it’s not something for us to worry about now. It would just be a distraction.

Cloudflare

Our app server(s) will be in Europe (because of GDPR), but most of our users will be in the VS. That’s not great from a latency point of view. It takes a couple of round trips to establish an SSL connection and if you have to cross an ocean that will make your site feel sluggish.

Cloudflare has many end points globally, and we want our app to load fast. We’ll move all static content to Cloudflare and we’ll make sure to design our app so it can load as much in parallel as possible. Cloudflare also provides us with free DDoS protection and their DNS service is great. One day Cloudflare will start charging startups like ours and we’ll gladly pay. For now, the free plan suits us just fine.


We’re bootstrapping and we’re running our startup on a shoestring budget. Everything on this list is free, except for the server(s) and those are dirt cheap. It’s almost comical how easy and cheap running online services has become. Most of the hip new tech we just don’t need. Our app will be client heavy and the server will be little more than a REST API and billing logic.