Substack is having replica lag issues
Why do comments you make on substack not appear immediately? In this post I describe why Substack is having these problems, and how they can solve it.
Since I have been writing on substack, I noticed that quite a few times the changes I have made to posts have not been updated immediately. The same thing goes for comments posted. You can even try it now, try commenting on this post and you will probably see that it takes a bit of time for your comment to show up on my post. My friend complained about this, which prompted me to write this post
I don’t actually know the inner workings of Substack’s system, but I can infer that this is a classic replica lag: read your writes consistency problem.
I actually wrote a post about this a couple of years ago, where I described the problem and how to fix it in a Rails application. If you are interested, feel free to check it out. But back to it, what is read-your-writes consistency? And how does this cause the weird UX we are seeing with Substack today?
Read your writes consistency is a guarantee that after any write you make to a database, subsequent reads of that data should reflect the changes from any previous writes. In Substack’s case, if I comment on a post I should see this change reflected immediately.
In a very simple single-instance database architecture, you probably won’t face this problem. Almost every modern database today supports some form of database transactions, with pretty good isolation levels. This problem tends to appear when we rearchitect our system to scale to support read load. Substack obviously is a read-heavy system, many more people read posts on substack than they actually write them.
The most common pattern to scale for a read-heavy system is to provision read replicas of your main database instance. Many companies tend to start with a basic relational database like MySQL or Postgres (Substack uses Postgres to my understanding), which only supports writes to a single leader instance by default. As read load increases, the first solution people go towards is provisioning read replicas of the main leader. You can then update your application to send writes to the leader instance, and all reads to replicas.
Another common pattern is to use some sort of caching, where reads are routed to a cache before reading from the database. The same principle applies here, you have more machines that your app can read from instead of reading from the source database machine directly.
So where does the problem occur? The problem occurs when there is replication lag; the time it takes for changes to propagate from the leader instance to all the read replicas. When you comment on a substack post, it submits a write to the leader instance. When your page refreshes it reads from a read replica that may not have the data from your comment yet. This is why we see this confusing UX, and it usually self-resolves when we refresh our page a few times.
How do we fix it?
It will really depend on Substack’s architecture but a few solutions come to mind. The easiest solution is to force all reads that occur within a time period of a write to read from your leader instance. If your app by default reads from a read replica or some cache, force the app to read from the leader database right after a write. If your system requires a much larger scale of reads, this strategy might not be feasible either though.
Another way is to rearchitect your application to use a horizontally scalable database. Instead of having all reads route to the single leader instance after a write, if multiple instances support writes you can route them to the instance that stores that specific piece of data. This way instead of all reads being forced onto a single instance, they are distributed across multiple instances because data is written on multiple machines. Obviously, this poses its own challenges, and you probably want a replication strategy here as well. But being able to horizontally scale does give you room to serve many more concurrent users. This is mostly a solved problem, most horizontally scalable databases implement a quorum-based approach to make sure you are returned the latest write, preventing this problem altogether.
If the problem occurs from caching, we will need to inspect the semantics of the cache. Is it a write-through cache or a read-through cache? What are we saving in the cache? If we cache comments on a post already when a new comment comes in does it add to the existing entry or create a new entry? All of these questions are relevant to architect the correct solution.
Obviously, I have no insight into Substack’s actual architecture so who knows if the above solutions will work. But the same principle applies no matter their architecture, you want to read from the machine that has the changes you just wrote.
Testing :)