I’ve been able to successfully configure two sipsorcery servers in a redundant configuration using Microsoft’s SQL Azure service as the database. That’s good news as it means in the near future it will be possible to switch the main sipsorcery.com service over and remove the exposure of the service failing if a single Amazon EC2 instance fails.
By far the biggest challenge to making the sipsorcery (and most other internet based services) reliable and scalable comes back to the database. It costs a lot of money and takes a lot of expertise to run ANY of RDBMS in a reliable scalable manner. It’s easy to get a single database instance up and running but once you need to start replicating, clustering and load balancing the headaches start.
The mysipswitch service used the Blue Face Postgresql database system. That satisfied the above concerns because Blue Face invested in the necessary hardware and employed an engineer to look after it. The sipsorcery service, which commenced in July 2009, deliberately separated itself from Blue Face’s infrastructure due to business, legal and other non-technical reasons and instead moved to Amazon’s EC2 cloud computing infrastructure. The sipsorcery service currently uses a single server instance which hold all the SIP application servers AND a MySQL database. That means the database is not redundant and there was a painful incident at the start of sipsorcery’s existence where a misconfiguration (by me) resulted in all the MySQL data being lost.
MySQL was used for sipsocery instead of Postgresql because it has better support for replication, clustering and load balancing; none of which Postgresql really supports out of the box or without jumping through a lot of hoops. As it turns out there are quite a few hoops in the MySQL case as well. The sipsorcery service requires a master-master replication strategy and so that two server instances can operate independently of each other but still share data. The MySQL recommended option in that case is MySQL cluster which needs a minimum of 6 servers! Using 6 servers for the sipsorcery database is prohibitive from a cost and admin point of view.
The next idea was to use Amazon’s SimpleDB. It’s not a relational database and instead more like a big bucket that applications can drop small bits of data into and then request them back at a later stage. It does have some rudimentary querying capability but there are big differences between it and a relational database. Since the sipsorcery database requirements are very rudimentary the Amazon SimpleDB service was largely able to satisfy them and I was able to get to a stage where a developement sipsorcery server was able to successfully operate using the SimpleDB as the data store. There were still a few concerns one being how well it would operate under load given that all communications with the SimpleDB must be HTTP over SSL which is significantly slower than the normal TCP connections used with a relational database. Another I was starting to rely more heavily on the querying capability of the MySQL database to shut down abusive sipsorcery accounts and switching to SimepleDB meant I would have had to divert a lot of effort to constructing equivalent detection tools.
Right at that time Amazon released their Relational Database Service product and when I got the introductory email I thought it was going to be the perfect solution for sipsorcery. However once I dug into the specifics it turns out the RDS service is no more than a MySQL server running on a single EC2 instance and the replication, clustering and load balancing are supposedly coming in the future.
Through looking at some other cloud solutions as a consequence of a 2 day outage of the sipsorcery service I re-visited the Microsoft Azure services and looking more closely at the SQL Azure service I realised it was claiming to be everything the Amazon RDS service should be. The SQL Azure service is still in a testing phase but is open to developers so I signed up for a test account and once that was enabled I was able to get a development sipsorcery server up and running with it in no time. At this point I have my fingers crossed that the SQL Azure service will work out and be as reliable and scalable as Microsoft hope because it really does solve a lot of problems for the sipsorcery service.
Once the data storage needs had been satisfied there was still some development work to make the sipsorcery service work properly when deployed over multiple servers. The original mysipswitch service was actually a single process. At the end of 2007 the memory leaks in the Ruby dialplan processing quickly forced the separation of the different servers into their own processes. Now in 2009 the unreliability of the Amazon EC2 instances has forced the further separation into multiple server agents on different machines. For a few of the services it’s not an issue, for example the SIP Registrar simply processes any REGISTER requests it receives and updates the database it does not need to know or care if there are other SIP Registrars operating in parallel. The SIP Registration Agent on the other hand needs a mechanism to ensure that if there are multiple agents operating they aren’t both registering the same SIP Provider accounts. The most difficult aspect is calls, specifically is a SIP account registers through the SIP Proxy on Server1 and then a call on Server2 needs to be forwarded to that account it must go through Server1’s Proxy and not Server2’s. Thaat’s because the end-user SIP account almost always has a NAT in front of it and it will drop any packets from a server it hasn’t already had a transmission with. The requirement then is to make Server1 and Server2 aware of each other and configure them to treat calls from each other’s Application Servers appropriately. That’s the chunk of work that I have recently completed and that is now working.
I plan on doing a bit more testing as well as watching how the SQL Azure database service performs over a longer period. At the moment if anyone is interested, for whatever obscure reason, in having a play with the test servers they are running under the sipwizard.net domain and the two servers are on 220.127.116.11 and 18.104.22.168. I have configured SVR records for sipwizard.net so if your SIP device supports them you can try things like blocking one or other of the IP addresses on your firewall and making sure calls still get through or your device can still register. Note that the sipwizard.net service is completely separate from sipsorcery.com and you’ll need to set up a new test account at www.sipwizard.net. Also the servers are only for testing and WILL be taken down at some stage in the next week as well as subject to my own testing so don’t rely on them being around for long.