Perils of caching

We try to design our software to eliminate entire classes of bugs. State updates happen in transactions so they’re all or nothing. We favor immutable state and INSERTs over UPDATEs. We like simple, functional (or idempotent) code that is easy to debug and refactor.

Caching, by contrast, exposes you to new classes of bugs. That’s why we’re cache-weary: we’ll cache if we have to, but frankly, we’d rather not.

Downsides of caching:

  1. Introduces bugs
  2. Stale data
  3. Performance degradation goes unnoticed
  4. Increased cognitive load
  5. Other tail risks

Caching introduces an extra layer of complexity and that means potential for bugs. If you design your cache layers carefully (more about that later) this risk is reduced, but it’s still substantial. If you use memcached you can have bugs where the service is misconfigured and listens to a pubic IP address. Or maybe it binds on both IPv4 and IPv6 IPs, but only your IPv4 traffic is protected by a firewall. Maybe the max key size of 250 bytes causes errors, or worse, fails silently and catastrophically when one user gets another user’s data from the cache when key truncation results in a collision. Race conditions are another concern. Your database, provided you use transactions everywhere, normally protects you against them. If you try to write stale data back into the database your COMMIT will fail and roll back. If you mess up and write data from your cache to your database you have no such protection. You’ll write stale data and suffer the consequences. The bug won’t show up in low load situations and you’re unlikely to notice anything is wrong until it’s too late.

There are also some more benign problems caused by cache layers. The most common one is serving stale data to users. Think of a comment section on a blog where new comments don’t show up for 30 seconds, or maybe all 8 comments show up but the text above still says “Read 7 comments”. In another app you delete a file but when you go back you still see the file in your “Recents” and “Favorites”. I guess it’s easier to make software faster when you don’t care about correctness, but the user experience suffers.

Many other types of data duplication are also caching by another name. If you serve content from a CDN you have to deal with the same issues I touched on above. Is the data in sync? What if the CDN goes down? If content changes can you invalidate the CDN hosted replica instantly?

Maybe you decide to reduce load from your primary database server by directing simple select queries to read-only mirrors. It’s also a form of caching, and unsurprisingly you have to deal with the same set of problems. A database mirror will always lag behind the primary server. That can result in stale data. What if you try to authenticate against the mirror right after changing your password?

Caching layers can also obscure performance issues. An innocent patch can make a cheap function really slow to run, but when you cache the output you won’t notice. Until you restart your service and all traffic goes to a cold cache. Unable to deal with the uncached load the system melts down.

Things that used to be simple are complicated now. That’s really what it all boils down to. Caching layers complicate the architecture of your system. They introduce new vectors for serious security vulnerabilities, data loss, and more. If you don’t design your caching abstractions carefully you have to think about the caching implications every time you make simple changes or add simple features. That’s a big price to pay when the only benefit of caching is performance.

Alternatives to caching

  1. Buy faster hardware
  2. Write smarter algorithms
  3. Simplification
  4. Approximation

Buying faster hardware is an underrated alternative. Really fast servers are not expensive anymore, and dedicated servers are much faster than anything you can get virtualized on the cloud. Dedicated servers have their own downsides, I won’t deny that. Still, performance is just much less of an issue when you have a monolithic architecture that runs on a server with 32 cores and 500gb of ram and more NVME storage than you will ever need.

Algorithmic simplifications are often low-hanging fruit. Things get slow when data is processed and copied multiple times. What does a typical request look like? Data is queried from a database, json is decoded, ORM objects are constructed, some processing happens, more data is queried, more ORM objects are constructed, new data is constructed, json is encoded and sent back over a socket. It’s easy to lose sight of how many CPU cycles get burned for incidental reasons, that have nothing to do with the actual processing you care about. Simple and straightforward code will get you far. Stack enough abstractions on top of each other and even the fastest server will crawl.

Approximation is another undervalued tool. If it’s expensive, for one reason or another, to tell the user exactly how many search results there are you can just say “hundreds of results”. If a database query is slow look for ways to split the query up into a few simple and fast queries. Or overselect and clean up the data in your server side language afterwards. If a query is slow and indices are not to blame you’re either churning through a ton of data or you make the database do something it’s bad at.

Our approach

The kind of caching we think is mostly harmless is caching that works so flawlessly you never even have to think about it. When you have an app that’s read heavy maybe you can just bust the entire cache any time a database row is inserted or updated. You can hook that into your database layer so you can’t forget it. Don’t allow cache reads and database writes to mix. If busting the entire cache on a write turns out to be too aggressive you can fine-tune in those few places where it really matters. Think of busting the entire cache on write as a whitelisting approach. You will evict good data unnecessarily, but in exchange you eliminate a class of bugs. In addition we think short duration caches are best. We still get most of the benefit and this way we won’t have to worry that we become overly reliant on our cache infrastructure.

We also make some exceptions to our cache-weary attitude. If you offer an API you pretty much have to cache aggressively. Simply because your API users are going to repeatedly request the same data again and again. Even when you provide bulk API calls and options to select related data in a single request you’ll still have to deal with API clients that take a “Getta byte, getta byte, getta byte” approach and who will fetch 1 row at a time.

As your traffic grows you’ll eventually have to relent and add some cache layers to your stack. It’s inevitable but the point where caching becomes necessary is further down the road than you’d think. Until the time comes, postpone caching. Kick that can down the road.

Look ma, no dependencies!

When building a new app or feature, it can be tempting to immediately start with “import framework” and then build from there. First search for the latest and greatest toolchain, and then get building. It’s best practice, right?

Before we do that, we always first have a look at how much time it would take to write the feature we need. Could we get away without adding another framework with all its dependencies? If the feature seems simple enough, we choose to write a version of it ourselves first.

Using a framework (and libraries to a smaller extent) is always a trade off. There are plenty of examples of both extremes of course. We’ve all seen those websites which load jQuery, Vue, React, Bootstrap and Tailwind, because the entire thing is just dependencies duct taped together. Or introducing a dependency for 11 lines of code, like left pad. On the other hand there are examples of “Not invented here” going so far that products become insecure because they ignore all the hard work done before them (e.g. don’t roll your own crypto, as the advice goes).

The benefits of frameworks are obvious. Everything it already does right you don’t need to work on. And when building an MVP (or feature in any stage of your startup for that matter) you need every minute you can save. So what are the costs we look at? The most obvious is that all those lines of code will eat RAM and CPU (and GPU these days), so if there are large parts you don’t need, that would be wasteful. But there are many other costs too:

  • Learning new concepts. Many frameworks usually lead to a rabbit hole of new concepts you need to explore, very specific to just that framework. Read the docs, explore some tutorials, find out what the “framework way of doing things” is. That includes dependencies on which the framework in turn relies on.
  • Maintenance. You need to keep up with at least security updates, so you need a system in place to learn about those and apply them in time. Ideally you could keep your code running for years and just apply some security updates. But typically the older a dependency gets, the bigger the chance is there won’t be updates for the “old” version, and you need to upgrade to a new version with lots of API changes. So even though your code itself still works perfectly, there’s some bitrot going on, and you need to keep updating your code. You might even have to start over learning new concepts for the new version. Another problem is you need to make sure those (security) updates and in turn all its dependencies can be trusted, which isn’t always easy to do.
  • Unexpected behavior & debugging. With large frameworks there’s always a risk some internal implementation “detail” isn’t really documented well, but is actually important to understand in order for your app to work right. Often you only find this out the hard way after running into a bug. Debugging large frameworks, just like any other codebases you’ve never worked on, can be really time-consuming with deep call stacks and dozens of files.
  • Unknown constraints. Another big risk is finding out half-way that the framework doesn’t really support your use case. Or it simply doesn’t scale.

For example, one of the features we’re working on for Thymer is real-time collaboration features using Websockets. Because our backend is Django (a framework which definitely saves us time!), I googled around and found out there’s a project called Django Channels, which extends Django beyond HTTP, including Websockets.

Again, it would sound tempting to simply start there, just import it and done. In this case we would already have to start out with quite a number of new concepts to explore, or new layers we have to add to the stack: ASGI and asgiref, Daphne server, Redis, interfacing with synchronous Django code. Some of those we had never heard of or worked with before. And that’s next to the API of the framework itself, and finding out whether it works for the purpose we need. I’m not at all implying Django Channels isn’t a great framework! The point is that using a framework has a cost, just like writing it yourself does, so we need to compare the two.

So let’s look at what we really need in terms of Websocket functionality. We need to accept connections, keep track of open connections, be able to query the database using Django’s ORM and finally inform open connections when a new event becomes available for them. That’s it. Looking at all this, it’s not completely obvious the integration Django Channels provides with Django is worth the costs (new concepts, add layers to the stack, maintenance, debugging, knowing we don’t run into unknown behavior).

In this case we made the decision to write a simple Websocket script ourselves. It turned out to be around 200 lines of Python3 code and it took 1 day to build.

That would certainly not have been faster when having to look into all the ins and outs of the framework, and even without a deadline of 80 days we’d want to get an MVP out as soon as possible. If we have many customers at some point and it turns out we need something more advanced, then we can still look into rebuilding it (because it’s just 200 lines after all!). And best case 10 years from now this simple script will still be running, happily serving customers, without any modifications. Which is exactly what happened to our websocket script for our first product.

Some database design strategies

Friday Wim wrote about Data and Event Handling on the client side of Thymer.com. Today is about our approach to data storage on the server side. All storage, with very few exceptions, will go directly into a SQL database. MySQL is what we know best, so that’s what we’ll use.

Almost all of our database tables will have at least these columns:

  • “id” (unique auto increment primary key),
  • “guid” (unique, user facing),
  • “group-id” (integer foreign key),
  • “created-at” (datetime),
  • “updated-at” (datetime),
  • “deleted-at” (datetime, nullable),
  • “destroyed-at” (datetime, nullable),
  • “extra_json” (plain text)

I’ll explain briefly what our reasoning is for having these columns.

GUID with unique index + auto-increment int primary key

Having auto-increment primary keys for internal use is convenient, but exposing these keys to the outside world isn’t great. It’s better to use GUIDs in the (JSON) API. Integer primary keys are considerably faster, though, so you can use use both. That’s what we’re doing. A side benefit is that you can use the GUIDs for backup and export/import purposes. You can’t easily import data with auto-increment keys, because you’ll get key collisions, but you don’t have this difficulty when you use GUIDs. They don’t collide by design. If you ever have to shard your service, or want to move data between SQL servers for another reason having GUIDs to work with will save you a great deal of time.

It also makes for much friendlier APIs. You can prefix the GUIDs with the class table it corresponds to. If the wrong kind of GUID is passed to the API, for instance “user_56b2c555-1ccf-4b9a” instead of “project_854a3923-0be9-4106” you can generate a friendly error message that the API function expected a Project instance instead of a User. If you just pass a generic integer you have to return a confusing error message about insufficient permissions because you can’t determine what the API users is trying to do.

Redundant group-id columns

Suppose your product allows groups of users to collaborate and you don’t want to accidently mix up information between groups. You don’t want to put yourself in a situation where users see other people’s data just because you forget a “WHERE” clause in an SQL statement somewhere. Mistakes are unavoidable, but some mistakes are unacceptable and we plan to design our system so the worst kind of mistakes can’t happen. If all tables have a group column (or equivalent) then you can enforce through your ORM that all queries are automatically filtered for access. You can make all database queries go through a single point that ensures a user of Group A can only see data from Group A.

A side benefit of storing redundant group information is that it makes for very easy clean up when people want to close their account. You just go through all database tables and you drop all records for that user/group. If you have to infer which data belongs to which user through foreign keys you can create a difficult situation for yourself. Especially when you need to delete a lot of data you want to be able to do so incrementally. You don’t want your service to become slow (or crash altogether) because a background task locks your database tables.

Multiple datetime columns

Created-at and updated-at speak for themselves. These are timestamps you want to display to the user in some fashion, and they’re too useful not to have. Deleted-at is for soft deletions (tombstones). Sometimes deletion is expensive and you want to just flag many items as deleted first and clean up asynchronously. When the data has actually been destroyed you can use ‘destroyed-at‘. You could delete the actual row from the database, but foreign keys could prevent that. In addition, if you want to show good error messages to users (“this item has been deleted”) it’s helps to keep the row around. Of course, you only need the metadata for this. The columns that contain actual user data you can overwrite when the user wants the data deleted.

If you provide some kind of full text search it helps to have an indexed-at column. You can use this to trigger re-indexing. You can use datetime columns like this for any kind of asynchronous update you want to run. Just select all rows where ‘indexed_at < updated_at‘ and do the necessary processing. Want to re-index a specific group? Just backdate the updated_at column and a background daemon will jump into action. Want to re-index 10 rows that didn’t get processed correctly the first time? It’s trivial. This is much less fragile than having some kind of message queue where you enqueue rows for additional work. A message queue can restart or get out of sync with the main SQL database really easily, but a simple datetime column works every time.

Append-only database tables

Dealing with immutable (meaning: unchanging) state is easy, and gives you version information for free. Being able to show users how their data has changed over time is a useful feature to add and it doesn’t cost anything. It’s just a matter of selecting the most recent row to get the most current data.

Storage is cheap and database engines can deal with large tables just fine. You can always fix performance problems when (and if!) they happen.

A surprise benefit is that caches automatically get invalidated if you use append-only strategies. You can’t have stale cashes for data that doesn’t exist yet. That’s a nice thing to get for free. (Assuming your app is read-heavy. For write-heavy apps this might actually be a disadvantage, because your cashes get invalidated prematurely.)

Multiple queries instead instead of complex queries

When you use an ORM to build your database queries you’ll get pretty naive queries by default. The ORM will happily select much more data than needed, or do thousands of small queries when one will do (see: N+1 problem). When you save objects most ORMs save the entire object back into the database, and not just the single property you’ve changed. ORMs aren’t sophisticated tools by any stretch of the imagination.

ORMs are really great at generating the most basic SQL queries, and we’ve learned that if you can rework your problem as a sequence of very basic queries you don’t end up fighting your ORM. I’ll give an example.

Suppose you want to generate a list of the 20 most popular books written by any of the top 3 most most popular authors. In Django ORM pseudocode:

popular_authors_ids = Books.sort(most_popular).limit(3).author_ids_only()
books_by_popular_authors = Books.filter(author__id__in=popular_authors_ids).sort(most_popular).limit(20)

The example is a bit contrived, but I hope it illustrates the point. By first selecting a bunch of ids that you need for the second query you end up with two very simple (and fast) database queries. Modern SQL databases are extremely powerful, and you can accomplish almost anything in a single query. But by splitting the work up in multiple queries you get three main benefits:

1) you know your ORM is going to generate reasonable SQL
2) you often get extra opportunities for caching the intermediate results
3) you can do additional filtering in Python, Ruby, NodeJS before you make another round-trip to the database to get your final results. This is often faster and easier than doing the same in SQL. Unicode handling in your database and in your target language are always going to subtly different, same with regular expressions. You’re better off using the database as dumb storage.

Extra JSON column

There are plenty of situations where you want to tag a couple of rows for whatever reason. Maybe you’re experimenting with a new feature. Maybe you need temporary storage for a migration. Maybe you need some logging for a hard to replicate bug. Being able to quickly serialize some metadata without having to add an extra column to a database is a big time-saver. You don’t even need to use the JSON serialization functionality in your database for this to be useful. Just being able to store a BLOB of data is where the real value is at.

Data & event handling in the app

As we continue working on the app, one of the decisions in building Thymer is how the data will flow through the app. Let’s look at some more technical details about all the event and data handling considerations.

Server communication

Because Thymer will work like an editor of sorts, it needs to feel like one you might have on your laptop, even though it can run in the browser. That means interactions with the app should be pretty much instant from the perspective of the user. When a user adds or removes text, we don’t have time to ask the server to do the rendering for us. Even if the server would be fast, when every millisecond to render something on the screen counts, that pesky speed of light limit alone would make a round-trip to the server too slow.

That means we need to do all the rendering in the app itself, and send data between the client and server in the background. All the rendering we do is optimistic, meaning it assumes that at some point we’ll be able to communicate it to the server, but it doesn’t matter when.

Data types

The main data type in Thymer is called a “line” for now. It’s like a line in a text document, except lines in Thymer can have all kinds of additional metadata. For example, we will support different “types” of lines, like a heading, a todo, a comment (and perhaps later on things like files or drawings or whatever else we want). Lines can also contain other information, such as references to other lines, tags, and they can be “scheduled”, have due dates and so on.

Lines can also have a parent line, so we can allow features to work on a tree structure, such as zooming in at a section of the document, or collapsing/expanding sections. A document is a list of lines in a certain order.

Next to the data itself, we store the changes on the “document”, called events. An event can be something like creating a line, editing a line, and so on. Although we store the latest version of the “document”, if you would replay all events from the beginning to the end, you would get to the same document.

A copy of all the data

In order to make the app feel instant, it’s important that we store a local copy of the “document” on the client. If something like scrolling down in your document or opening a schedule for next week would result in a delay while waiting from the server, that would not be a good editing experience. In addition, unlike text, it’s possible to link to other parts of the document, so those parts need to be loaded too.

In order to store a local copy of the document, we use the browser’s built-in database, called IndexedDB. Unfortunately, IndexedDB is rather slow, so we also keep an in-memory cache and try to do as few updates to the IndexedDB as possible.

An extra advantage of storing the document like this is that we will be able to easily make the app work offline as well, so you can keep working with Thymer while on the road (in the air).

Because almost all functionality will be client-side anyway, we could even look into something like end-to-end encryption, but we might not have time for that for the first version.

Collaboration

Another factor to consider is that we need the app to be collaborative. That means that not only should we send our own events to the server, but also listen to incoming changes the server sends us. For this part we use websockets. Whenever the user makes any changes, we’ll tell the server about it, which will then tell other clients which are online.

To sync with the server, we send our changes and request the server to send us changes from other clients. We’ll go into the exact algorithm and data types to do this in some other post, but in the end we end up with a list of “events” which we should apply to the document we have stored locally.

UI components

Another reason we have to think about “collaboration” is that even when someone uses Thymer by themselves, things still need to work if you have multiple browser tabs open. And even if that wouldn’t be necessary, then the point of the app is to have features popular in IDEs, such as split-panel views.

That means that when you make a change somewhere in the document, it’s possible that that same section of the document is also visible in another panel, and would need to be updated there as well. From the other panel’s point of view, the change might have come from anywhere, but we need to re-render the line so it’s updated. That means the events need to be handled by all components in the UI.

Combining it all

Because changes can come from multiple sources, multiple parts of the app might have to be updated because of a change, and simply rendering everything to the browser’s DOM would never be fast enough, we use a simple event system in which data always flows in the same direction.

That way, when we make a change somewhere, we don’t have to think about which of a dozen things we need to update one by one. Instead, we create an event, and send it to the root of the app, which sends it to all visible components which can then update themselves. For performance reasons, we take one shortcut in the editor itself: it will immediately respond by drawing the change, after which it will create an event which will inform other components.

As an example, when we paste “hello world” on an empty line:

  • The editor will immediately draw ‘hello world’ to the line’s DOM node
  • The editor panel will create an “update line” event, which is dispatched when the browser has time. We’ll experiment a bit with the best heuristic for performance. This could either be a timeout, or browser callbacks like requestIdleCallback. Using a timeout, we can replace multiple events happening in a short time with one single event (so we can combine individual keystrokes in quick succession to one update event)
  • When the “update line” is dispatched, the app will update the local database (in-memory, and make an asynchronous request to add the change to IndexedDB), and queue the event to be sent to the server
  • Each component in the app for which the “update line” event is relevant receives the event, updates its local state and redraws a part of the screen.
  • After a while, the event is sent to the server. Any new event which is received as a reply follows the same flow and all components will self-update.

A no-nonsense server architecture for group based SaaS

If you’re never built a SaaS product before all the server-side stuff might seem overwhelming. How many servers do you need? Magnificent monolith or microservices? VPS, app engines, end point computing, dedicated machines? Do you need a CDN? SQL, noSQL, distributed or not, failover, multi-master or RDS? So many decisions! And if you get it wrong it’s going to cost you.

Thankfully, we’ve built SaaS products before which means we’ve got a rough idea of the kind of problems we’re likely to encounter at different levels of usage.

Our guiding principles:

Don’t overspend

The cloud is great in many ways but the virtual machines are underpowered and expensive. Cloud servers make perfect sense when you don’t want to do sysadmin work or when your load is so unpredictable you need to dynamically spin up additional servers. Neither applies for us. A single $100/month dedicated machine gets you 16 cores, 128GB ECC ram, and 8TB of enterprise NVMe SSD storage. That’s a lot of horsepower!

A single server like that can support more simultaneous users than we’re ever going to realistically have. We could probably get a server that’s one third as powerful and not notice the difference. We won’t need to heavily optimize our server code and we won’t have to worry about communication between microservices. A monolith is simple and that eliminates entire classes of bugs, so that’s what we’re going for.

Boring tech is better

We’ll use Debian stable. We don’t mind that all packages are 9 months behind other distributions. We don’t need the latest point release of bash or Python. We want stable packages and regular security updates. That’s what Debian offers.

Django + MySQL is will be the core of our backend stack. Because our SaaS app will be group based (meaning: no shared data between groups) scaling and caching will be very easy. Servers like twitter are complex because every action potentially affects any of the other billion users in the system. But when you make a service where all user groups are independent you don’t get this explosion in complexity. If we suddenly end up with millions of users (yea, right!) and there is no faster server available (vertical scaling) we can still scale horizontally if necessary. It’s just a matter of buying a bunch of servers and evenly dividing our users between them.

We have used Python/Django for years so that’s what we’re going with. We could have picked golang or any other web stack. It doesn’t matter that much.

We don’t need a cache server, a message queue, or any other services at this point.

We should get 99.99% uptime with this server architecture, simply because there is so little that can break. Provided we don’t do anything dumb :). All downtime is bad, of course, but there are diminishing returns at some point. I think 1 minute of downtime per week is acceptable.

Security Paranoia

We’ll cover this in depth in future posts. Firewall, sshd policies, full disk encryption, and intrusion monitoring are core parts. The mantra is: layers of defense. If system files get changed, we want to know about it. If a program listens to an usual port, we want to know.

We’ll do backups with rdiff-backup (local + remote). We can use it to diff what changed on a server, which means it also functions as an audit tool. What makes rdiff-backup so cool is that it computes reverse diffs, as opposed to forward diffs. This means the latest backup set consists of plain files that can easily be verified to be correct. In addition there is some compressed archival that allows you to go back in time for individual files/directories. You never have to worry about the archival system getting corrupted.

For full disk encryption we’re going to need to employ some tricks. We want to encrypt everything, including the OS. For a remote headless server that requires some ingenuity. More about this later.

Mailgun

Mailgun is what we know so that’s what we’ll use for our planning IDE Thymer. Mailgun has always had reliability issues though and deliverability is mediocre, even when using a dedicated IP address. We can always switch transactional email providers at some later point, so it’s not something for us to worry about now. It would just be a distraction.

Cloudflare

Our app server(s) will be in Europe (because of GDPR), but most of our users will be in the VS. That’s not great from a latency point of view. It takes a couple of round trips to establish an SSL connection and if you have to cross an ocean that will make your site feel sluggish.

Cloudflare has many end points globally, and we want our app to load fast. We’ll move all static content to Cloudflare and we’ll make sure to design our app so it can load as much in parallel as possible. Cloudflare also provides us with free DDoS protection and their DNS service is great. One day Cloudflare will start charging startups like ours and we’ll gladly pay. For now, the free plan suits us just fine.


We’re bootstrapping and we’re running our startup on a shoestring budget. Everything on this list is free, except for the server(s) and those are dirt cheap. It’s almost comical how easy and cheap running online services has become. Most of the hip new tech we just don’t need. Our app will be client heavy and the server will be little more than a REST API and billing logic.