Wednesday, December 12, 2007

What is a futex and why does Phi hate it?

A futex is a mechanism for dealing with resource contention and synchronisation between (semi)independent threads or processes. It is used to build locks and various other tools for managing threads.

What? didn't understand a word of that? no big deal. it's just a coding concept that means hundreds of different user requests can all access the database at the same time without getting in each others way.

I hate it, because right now, I'm having real trouble with one on Entrecard. The problem is simply a situation called a "deadlock". Imagine the following scenario (simplification):

Piece of code A uses Lock 1, then Lock 2
Piece of code B uses Lock 2, then Lock 1

These will always work, except in the instance when A and B initiate at precisely the same time. When this occurs, code A gets Lock 1, but can't get Lock 2, and code B gets Lock 2, but can't get Lock 1. As a result both of them sit there waiting for the other, and neither of them will release their lock. Immediate result: horrible nasty application hang.

The evil bit is that you'll never notice this scenario when you're developing your application, because a single user almost never triggers this kind of event. Nor are you likely to notice it when your application has a low volume of users, for the same reason. It appears out of nowhere just when you have a bunch of users, and just when you really really don't want things like horrible nasty application hangs.

I have one of these somewhere. The problem is not fixing it, the problem is finding it. Entrecard is a lot of code, with a very high level of abstraction, and the nature of a deadlock makes it very difficult to debug after it has happened in a complex environment. Worse, it happens totally randomly, so I'm unable to narrow down where in the application it is happening. I'm confident I'll get it but HOLY CRAP is it frustrating, especially when I have to drag myself out of bed at 3am because it hung the app server and I have to restart it.

Labels: , , ,

5 Comments:

Blogger Patrice said...

I think it's good that you are open about the technical problems you currently have with Entrecard.com!

And, listen... your futex issue is a FANTASTIC GOOD NEWS ;-) Why do you have such issues? Because you have a f... huge traffic to your site! Believe me, I do not have ANY process synchronisation issues, but I would like to have some ;-)

/Patrice - http://wavumi.blogspot.com/

December 12, 2007 5:22:00 AM NZDT  
Blogger Tyler Mulligan said...

Oh man, something I never really thought about. Getting out of bed at 3am makes it all the more important to notice these errors and have a methodology to troubleshoot the situation as efficiently as possible. What do you plan on doing to find the deadlock?

December 12, 2007 11:13:00 AM NZDT  
Anonymous Shonzilla said...

Haha... despite my computer science and engineering background I initially though futex was a pun on mutex when things are f*ed up regardless. Wikipedia got me up to speed.

Anyway, I am now curious for the first time - what programming language are you using on top of Amazon S3?

Perhaps there's a code profiling/analyzing tool for that given language that might help you out.

Cheers!
Shonzilla

December 14, 2007 2:16:00 PM NZDT  
Blogger ArahMan7 said...

I'm sure you'll find that a-ha moments.

Keep up the good works PhiRatE. I wish I could be a great coder like you. My latest post deal with a very simple scripts. I'm sure you gonna laugh when you see how simple it is.

Greetings and lotta loves from Malaysia.

~ ArahMan7

December 17, 2007 9:07:00 PM NZDT  
Blogger PhiRatE said...

@tyler I plan to find the deadlock..with difficulty :) basically it involves two stages. Once stage, already complete, is to mitigate the problem to reduce operational impact. We now have a monitoring script and a few other things turned off to reduce the impact. The second stage involves fixing the problem. This essentially means going through every sequenced database interaction and ensuring it can't possibly create a deadlock condition. A fair bit of work but doable over time.

@shonzilla We don't actually use a language on top of S3 - you can't, it's static hosting only. Instead S3 serves html and javascript, which in combination with some cookie stuff lets us offload the hard part of widget display. The main entrecard site uses an application platform called TurboGears, in the python language.

@arahman Coding has its ups and downs. The big up is creating something people love, Entrecard has been fantastic in this regard. The downside is the frustration when you just can't figure out a problem. Both the ups and downs are present no matter what level of coder you are :)

December 17, 2007 11:26:00 PM NZDT  

Post a Comment

<< Home