I’ve been working with telco software for the last 7 years and by far far and away the most painful part on the server side is a reliable database set up. NAT would take the cake for the most painful thing overall. This post was inspired by the sipsorcery MySQL service crashing yet again.
I have the service configured to automatically restart so the crash would have hardly been noticeable.
In the last couple of months the previously rock solid MySQL instance on the sipsorcery server has become increasingly flakey and the process is now crashing on average once a week. From the error logs the suspect is something to do with MySQL’s SSL handling and I did find an issue on the MySQL bug list that “may” be the cause. I need to schedule time to upgrade to the latest MySQL version but that’s not a trivial matter and will probably mean between 30 and 60 minutes of downtime.
The recent problems I’ve had with MySQL highlight a current theme I’ve experienced throughout my time working with VoIP/telecommunications software and that is building and maintaining a five 9’s reliability database is either very expensive or very tough.
The database I used for my first VoIP platform was Postgresl. All in all it was pretty reliable but issues did crop up. In one case the data files got corrupted due to a Postgresql bug handling Unicode on Windows. After recovering from that we moved to a Linux platform and suffered a couple of outages due to hardware and operating system issues. That led us to bite the bullet and spend a lot of money (for a small company) on a Solaris SunFire server and storage array. That ended up being disastrous; first a firmware fault in the fiber channel controller on the storage array caused some major outages and lots of messing around to replace and after sorting all that out there were a few kernel panic incidents that I can’t recall the cause of. The problem with Postgresql at the time is that it didn’t have a replication/fail over solution. There were side projects which we tested out but they were all immature and some introduced prohibitive performance penalties. We ended up using archive log shipping where the transaction logs from our main server were copied over to a standby Linux Postgresql server. Unfortunately that caused a couple of outages as well; the disk on the Linux server filled up and the Linux Postgresql instance stopped applying the transaction logs and that caused the primary Postgresql instance on the Solaris server to shutdown to preserve the integrity of the data. Eventually things settled down but it was serious enough at the time that it was considered a threat to the survival of the business.
The Solaris experience caused us to appreciate even more how important a reliable database was so we had a chat to Oracle. They promised a multi-server, real-time fail over system but the price was exorbitant and way out of our league. We were desperate enough to consider it at the time but in the end common sense prevailed and we soldiered on with Postgresql.
With mysipswitch and sipsorcery it was a chance to try and find a better solution is a less demanding environment. By that I mean if something went wrong it was an inconvenience to people but they were warned in advance that the system was experimental and came with no guarantees. The mysipswitch service actually spent it’s whole life using the same Postgresql database I mentioned above and it was only when the service morphed into sipsorcery that a different database approach was attempted. The sipsorcery service was intially deployed for a very unhappy year on Amazon’s EC2 infrastructure. Initially only a single server deployment using a local MySQL database was used. When the EC2 instance started going down every second day the deployment model changed to two servers with an SQL Azure database. I actually thought at the time SQL Azure was finally the perfect solution. It was cheap at $10 a month and all the hard things about running a database were taken care of by Microsoft. However there were a few glitches that caused 5 minutes of downtime here and there and at the time I put it down to the fact that the service was brand new, this was in Jan 2010 when SQL Azure had only just been opened for service. The small outages were bearable but the real problem came when i finally had to give up on EC2 and move to a dedicated server. At that point the SQL Azure database I was using started getting the connection requests that were previously spread over two EC2 instances from a single dedicated server. It wasn’t a huge number, between 20 and 30, but it was enough to cause the SQL Azure Denial of Service software (called DOSGuard apparently) to drop all connections for up to an hour. SQL Azure support were happy to send me emails back and forth for nearly 3 months about the issue but in the end it was something they couldn’t or wouldn’t fix. In my opinion it’s a strange limitation and one that probably stems from the fact that SQL Azure is mainly pitched as being a solution for web applications. Apparently the sipsorcery software was getting flagged because the connections were coming from seven different processes.
So after SQL Azure it’s back to a local MySQL instance on the dedicated sipsorcery server. It’s almost tempting to switch it to Postgresql to complete the circle.
At one point I did look at NoSQL options like Amazon’s SimpleDB but the latency was a killer with it taking almost a second for even the simplest queries. I also checked out some of the other similar offerings but they all appeared geared up for web applications where response times of up to 500ms aren’t a problem. For sipsorcery the response times need to be well under 100ms.
I also know MySQL has replication and load balancing options and I have explored them. The problem is to get automatic failover it needs something like 6 servers. Without automatic failover there’s not a lot of benefit for sipsorcery to replicate data to a standby node. That means Amazon’s RDS service is also not a great option.
I’d love to hear if anyone knows of any other type of service out there that might be worth looking into?