Sunday, May 4, 2008

Caching is stupid

Before you freak out, let me run with this one for a bit, it's based on some very practical experience.

Caching is one of those things we instinctively turn to when we hit some kind of performance bottleneck within a web application - after all, the web is pretty much designed to be cached and has all these lovely headers and things, so it seems to make intuitive sense that if your stuff might get cached out there, you may as well cache it in here.

The problem is that caching actually implies a serious problem, one that we often simply accept without any real thought: the authoritative source of our data is too slow to deliver the information accurately as fast as it is required.

Now, sometimes this is reasonable - if you're delivering an aggregated RSS feed then fine, you need to keep a cached copy of the source data for a while because it would be rude to hammer other peoples servers for every request.

Other times, however, it's actually just bad, and it's mostly the fault of the RDBMS.

We in the web community have become so used to the authoritative-database model of life that we refuse point-blank to delegate authority to any other system. After all, the RDBMS has all kinds of good things like ACID which will ensure your data is always consistent etc etc.

The problem is that data consistency has become this strangely religious holy grail. Never mind that foreign key constraints slow the living crap out of your inserts and updates, never mind the fact that in many cases using an ORM means you'd never have an invalid reference anyway, never mind that many people understand transactions in a database so poorly that they simply ascribe them magical powers of correctness they have no hope of guaranteeing, it's still "database or nothing!"

I remember the first time I got thrown out of that comfort zone. In 2000 I was in Sydney working for a great company called Iguana as their systems/linux guy. We were struggling pretty seriously with the huge load of information we were pulling in from the NZX and ASX and our database server was a monster. Scott Cooper - to this day my example of an brilliant programmer - was the programmer there, and simply did the obvious thing - he wrote a C daemon that stored the data in-memory, with basic checkpointing and methods for doing data catchup etc. As with everything Scott wrote, it was a work of art and brutally fast, leaving the database in the dust, relegated to dealing with historical data at its slow, databasey pace.

Since then I have rarely come across a situation where a database didn't do the job in an acceptable timeframe, but there have been a few. In every case, my initial instinctive solution was to install memcached and "do something with it". In many cases this was fine, but in a few situations it was a serious, serious error. Case in point, the most recent Entrecard release involved getting rid of half the caching we were doing - which was failing to provide accurate results, and generally screwing up - and replacing the data source entirely with an in-memory daemon that acted as the authority.

This daemon was written in python using the Twisted framework - a solution that allowed me to access data without contention issues. Careful coding meant that there were basically no loops at all within the code, preventing any possibility of lockup and providing very consistent timing for individual operations. The database had grown so large (millions of rows) that providing an accurate balance for an individual user took upwards of *3 seconds* per request, and the parallel nature of the balance modifications meant that it wasn't possible to "cache" the result effectively or accurately.

The daemon, however, consistently delivered a perfectly accurate balance in-flight in under 0.001s, dramatically improving our ability to scale The daemon also took over a number of other high-contention operations, including managing the card drops. This allowed us to return to the "glory days" of Entrecard where drops were counted immediately and the balance was always up to date, rather than the sad state of affairs we had recently where drops were processed "some time later" by a batch processing system.

Next time you're having performance problems and you start thinking about caching, queues, batches etc take a second look at your architecture - is your current authoritative data source really the best solution to the problem? perhaps rather than using a cache to wallpaper over your problem, you'd be better off creating a data source that can deliver the data at the rate you really need.

Labels:

0 Comments:

Post a Comment

<< Home