It was brought to my attention that towerwars.info didn’t update correctly this weekend. When I looked at the site, I noticed there were nothing new since 5th May. That’s a day and a half of quiet; And knowing the Anarchy Online community, that only happens when they’re physically prevented from raging war. Something was wrong.
I logged onto the bot shell VM, checking if the bots had crashed or been unable to reconnect for some reason. They were okay. Then I logged into AO to make sure the bots weren’t just unaware of being disconnected; All were connected and responding.
Okay, so far so good. What could possibly be wrong?
I think I need to explain how data is transferred from the tower battle log messages to the site for this to make any sense. Basically, there’s some bots connected to both dimensions, logging all the tower war messages. The bots parse these messages and update a local database with what they find. They also log the raw text to a file, in case the parsing fails or the bot is unable to write to the database for one reason or other. This allows for manual recovery if necessary. It also allows more dynamic changing of the site and its database schema without having to worry about breaking the tracker bots. The new database is synced with the legacy databases every 15 seconds. This means a message should be available to the website within 15 seconds of it being logged, usually less.
So the bots were okay; Time to proceed further down the chain to find what’s wrong!
I checked the latest entries from the local databases, and these were from a few hours previous; Over a day more recent than the latest displayed record on the website. Okay, this was good news: The data was there, no glaring 2 day hole in the logged history. Good!
So I figure the sync job stopped working for some reason or other. I go to log in to the user which runs the script, only to find the server isn’t responding after entering the username. What’s going on? The website is responding; but the server doesn’t let me log on remotely. One of the perks of running that website in a virtual machine is that I can get console access just as easily as remote access, simply by using VMware VSphere Client. Opening console…. and…. not responding. This is a true W.T.F. moment: How can the website respond nicely, SSH server respond (but not be able to spawn the login process), and the console not repond at all?
The problem seems to be a bug with the FreeBSD 9.0 ULE kernel scheduler when running under VMware; Under certain conditions, it will stop assigning CPU time to some processes. The fix was to recompile the kernel using the legacy 4BSD kernel scheduler. I’ve encountered this problem once before, so it was luckily something I had fresh in mind. I had thought it was a pretty rare bug though, so I hadn’t applied the fix to other VMs. Now, however, my “Setup FreeBSD virtual machines” procedure has one additional step: Compile custom kernel with 4BSD scheduler.
I ran the sync script after updating the kernel, and everything were a-okay.