One of the questions the sipsorcery project has been aimed at answering is whether it’s possible to run a telco in the cloud. By cloud I mean a public platform offering on demand resources at an operating system or database level. The sipsorcery service moved from a single server hosted in private infrastructure to a single server in the Amazon EC2 cloud in July 2009 from which point on the sipsorcery service could be said to have become a cloud hosted telecoms service. For a service like sipsorcery the advantages of operating in the cloud were envisaged primarily as cost, scalability and reliability.
Within about 6 weeks of migrating to the EC2 cloud, and after sorting out a few software issues within the sipsorcery code, the Amazon EC2 failures started and have continued ever since. The symptoms of the failure were that the underlying Windows virtual server hosting the sipsorcery software just dropped off the network and stopped responding to any type of network traffic: pings, RDP, HTTP, SIP, everything. Initially I suspected it must have been a problem in the sipsorcery software but after a lot of effort over 3 or 4 months and from keeping an eye on the EC2 forums for people having similar experiences I became suspicious that it was more than likely an issue with something related to Amazon’s infrastructure either at the network, firewall or host level. Since that initial suspicion I’m now 99% sure the problem is at the host level and that the network driver in the Xen virtualisation software is failing under certain conditions. The strongest evidence for this is that when a sipsorcery instance drops off the network it will miraculously start working again around 3 hours later assumedly after the underlying Xen software clears out its network connections cache or something.
Since 2006 I have invested a lot of time and effort in working with Amazon’s EC2 and S3 products and when the sipsorcery EC2 failures started causing a lot of pain I rationalised that an alternative cloud service provider would most likely have their own issues and it was better to stay with the devil you know. To that end I signed up to EC2’s premium support (at USD$100 per month) so I could log a ticket to resolve the issue. Unfortunately that proved fruitless and the advice I received was to try the latest Windows image provided by Amazon and see if that helped, reasonable advice but it the updated images exhibited exactly the same issue. I cancelled the premium support and resorted to vociferous whinging on the EC2 forum. After a month and half of that approach I eventually got an email from an Amazon engineer who did look into it for me and while not so much acknowledging that the issue was Xen related suggested I try a Windows image that was backed by Elastic Block Store (EBS) storage because it had some additional updated Xen drivers. So with high hopes I toddled off again and built up the sipsorcery image, which I had down to a fine art by now. It actually took a few weeks for the EBS backed instance to fail and I’d actually started to believe it had solved the problem but eventually it did fail and the Amazon engineer didn’t have any further ideas.
So back to the 3 advantages of operating in the cloud, the reliability of the service certainly diminished a lot after moving from a dedicated server but what about cost and scalability? To overcome the repeated failures of the sipsorcery EC2 instance a second failover instance was utilised meaning there were now two sipsorcery EC2 instances required. In addition the size of both the instances needed to be upgraded from small to medium to cope with the increase in sipsorcery users. And often during troubleshooting the failures it was necessary to leave one or two failed instances running in case Amazon ever wanted to check them. Suffice to say the average monthly cost for the sipsorcery EC2 servers is 5 or 6 times what had been originally forecast and is 2 or 3 times more than a dedicated server with equivalent specifications. That leaves scalability. The first scalability problem for sipsorcery was the database. In the initial single EC2 instance deployment a MySQL database sitting on the same instance was used, the very first instance failure and some a misconfiguration of the MySQL database by me resulted in the very first sipsorcery users losing their data. To scale the database a better deployment model was needed. About that time two new products came out, one was Amazon’s Relational Database Service (RDS) the other was Microsoft’s SQL Azure. Amazon’s RDS is based on MySQL and since it’s probably in the same data center as the EC2 infrastructure my preference was to use it. However it didn’t take long to realise it was a poor solution, there was no replication, no clustering the product was simply a single MySQL server sitting on its own instance which was not much different from what sipsorcery already had, no thanks. Microsoft’s SQL Azure was unsurprisingly based on Microsoft’s SQL server and was about 100ms of network away from the sipsorcery EC2 servers but it was still compelling because it claimed to solve two of the most difficult database problems by handling data replication for fault tolerance and having a deployment model that allowed it to scale demand across the SQL Azure cloud. But it was a new product and if the EC2 experience was anything to go by was it worth risking. In December of 2009 I did move the sipsorcery data to SQL Azure and while there have been a few issues that have caused outages to the sipsorcery service in general it has worked extremely well. Even more importantly on one of the issues that I experienced and posted on the SQL Azure forums about I got an email from the Microsoft SQL Azure product manager who got his engineers involved to identify the root cause. And in this case, unlike with Amazon, the Microsoft engineers were able to provide a good explanation and more importantly that particular issue hasn’t re-occurred.
Starting from November 2009 I have kept track of all the sipsorcery failures related to the “clouds” and while I won’t list them all a summary is interesting.
- Between 16 Dec 2009 and 25 Mar 2010 there were 22 detected outages totalling 26.9 minutes with the longest being 2 minutes and 43s,
- On the 28th Mar 2010 an approximately 3 minute outage occurred,
- On the 9th of April 2010 an approximately 3 hour outage occurred however it is not clear whether this was caused by a network issue or an SQL Azure issue. SQL Azure engineers investigated and found no evidence of a problem and during the outage I was able to connect to the database from outside the EC2 network.
- Between 1 November 2009 and 13 Apr 2010 there were 28 detected outages totalling over 32 hours with the longest being 8.5 hours,
- EC2 outages to date have not occurred simultaneously to both servers (sip1 and sip2) so the outage times apply only to a single server instance and not to the overall service,
- EC2 outages require a server instance to be manually rebooted which takes a minimum of 15 minutes. I am notified of outages via SMS and email with 30s but depending on my circumstances remedial action can vary between instant and up to 8 hours, on average it’s generally less than 30 minutes.
To answer the original question relating to can a telco run in the cloud my answer is “probably” but if part of that cloud is Amazon’s EC2 then the answer is “probably not”. I know of another SIP based service that has recently started on Amazon’s EC2, unlike sipsorcery they are Linux based, so it will be interesting to follow their experience.
As for sipsorcery I’ve identified a promising alternative to EC2 that I hope to migrate the service to at some stage. One big advantage of this alternative provider is that they have F5 load balancers which would allow the sipsorcery service to be deployed reliably without having to depend on SIP SRV records which have been shown to be pretty poor as a failover mechanism for SIP clients, when a sip1 outage occurs approximately two thirds of the sipsorcery clients drop off and don’t fail over to sip2. However there are some funding challenges involved in migrating sipsorcery from EC2 and I need to come up with some way to pay for the new cloud host.