Telco in the cloud?

One of the questions the sipsorcery project has been aimed at answering is whether it’s possible to run a telco in the cloud. By cloud I mean a public platform offering on demand resources at an operating system or database level. The sipsorcery service moved from a single server hosted in private infrastructure to a single server in the Amazon EC2 cloud in July 2009 from which point on the sipsorcery service could be said to have become a cloud hosted telecoms service. For a service like sipsorcery the advantages of operating in the cloud were envisaged primarily as cost, scalability and reliability.

Within about 6 weeks of migrating to the EC2 cloud, and after sorting out a few software issues within the sipsorcery code, the Amazon EC2 failures started and have continued ever since. The symptoms of the failure were that the underlying Windows virtual server hosting the sipsorcery software just dropped off the network and stopped responding to any type of network traffic: pings, RDP, HTTP, SIP, everything. Initially I suspected it must have been a problem in the sipsorcery software but after a lot of effort over 3 or 4 months and from keeping an eye on the EC2 forums for people having similar experiences I became suspicious that it was more than likely an issue with something related to Amazon’s infrastructure either at the network, firewall or host level. Since that initial suspicion I’m now 99% sure the problem is at the host level and that the network driver in the Xen virtualisation software is failing under certain conditions. The strongest evidence for this is that when a sipsorcery instance drops off the network it will miraculously start working again around 3 hours later assumedly after the underlying Xen software clears out its network connections cache or something.

Since 2006 I have invested a lot of time and effort in working with Amazon’s EC2 and S3 products and when the sipsorcery EC2 failures started causing a lot of pain I rationalised that an alternative cloud service provider would most likely have their own issues and it was better to stay with the devil you know. To that end I signed up to EC2’s premium support (at USD$100 per month) so I could log a ticket to resolve the issue. Unfortunately that proved fruitless and the advice I received was to try the latest Windows image provided by Amazon and see if that helped, reasonable advice but it the updated images exhibited exactly the same issue. I cancelled the premium support and resorted to vociferous whinging on the EC2 forum. After a month and half of that approach I eventually got an email from an Amazon engineer who did look into it for me and while not so much acknowledging that the issue was Xen related suggested I try a Windows image that was backed by Elastic Block Store (EBS) storage because it had some additional updated Xen drivers. So with high hopes I toddled off again and built up the sipsorcery image, which I had down to a fine art by now. It actually took a few weeks for the EBS backed instance to fail and I’d actually started to believe it had solved the problem but eventually it did fail and the Amazon engineer didn’t have any further ideas.

So back to the 3 advantages of operating in the cloud, the reliability of the service certainly diminished a lot after moving from a dedicated server but what about cost and scalability? To overcome the repeated failures of the sipsorcery EC2 instance a second failover instance was utilised meaning there were now two sipsorcery EC2 instances required. In addition the size of both the instances needed to be upgraded from small to medium to cope with the increase in sipsorcery users. And often during troubleshooting the failures it was necessary to leave one or two failed instances running in case Amazon ever wanted to check them. Suffice to say the average monthly cost for the sipsorcery EC2 servers is 5 or 6 times what had been originally forecast and is 2 or 3 times more than a dedicated server with equivalent specifications. That leaves scalability. The first scalability problem for sipsorcery was the database. In the initial single EC2 instance deployment a MySQL database sitting on the same instance was used, the very first instance failure and some a misconfiguration of the MySQL database by me resulted in the very first sipsorcery users losing their data. To scale the database a better deployment model was needed. About that time two new products came out, one was Amazon’s Relational Database Service (RDS) the other was Microsoft’s SQL Azure. Amazon’s RDS is based on MySQL and since it’s probably in the same data center as the EC2 infrastructure my preference was to use it. However it didn’t take long to realise it was a poor solution, there was no replication, no clustering the product was simply a single MySQL server sitting on its own instance which was not much different from what sipsorcery already had, no thanks. Microsoft’s SQL Azure was unsurprisingly based on Microsoft’s SQL server and was about 100ms of network away from the sipsorcery EC2 servers but it was still compelling because it claimed to solve two of the most difficult database problems by handling data replication for fault tolerance and having a deployment model that allowed it to scale demand across the SQL Azure cloud. But it was a new product and if the EC2 experience was anything to go by was it worth risking. In December of 2009 I did move the sipsorcery data to SQL Azure and while there have been a few issues that have caused outages to the sipsorcery service in general it has worked extremely well. Even more importantly on one of the issues that I experienced and posted on the SQL Azure forums about I got an email from the Microsoft SQL Azure product manager who got his engineers involved to identify the root cause. And in this case, unlike with Amazon, the Microsoft engineers were able to provide a good explanation and more importantly that particular issue hasn’t re-occurred.

Starting from November 2009 I have kept track of all the sipsorcery failures related to the “clouds” and while I won’t list them all a summary is interesting.

SQL Azure

  • Between 16 Dec 2009 and 25 Mar 2010 there were 22 detected outages totalling 26.9 minutes with the longest being 2 minutes and 43s,
  • On the 28th Mar 2010 an approximately 3 minute outage occurred,
  • On the 9th of April 2010 an approximately 3 hour outage occurred however it is not clear whether this was caused by a network issue or an SQL Azure issue. SQL Azure engineers investigated and found no evidence of a problem and during the outage I was able to connect to the database from outside the EC2 network.

EC2

  • Between 1 November 2009 and 13 Apr 2010 there were 28 detected outages totalling over 32 hours with the longest being 8.5 hours,
  • EC2 outages to date have not occurred simultaneously to both servers (sip1 and sip2) so the outage times apply only to a single server instance and not to the overall service,
  • EC2 outages require a server instance to be manually rebooted which takes a minimum of 15 minutes. I am notified of outages via SMS and email with 30s but depending on my circumstances remedial action can vary between instant and up to 8 hours, on average it’s generally less than 30 minutes.

To answer the original question relating to can a telco run in the cloud my answer is “probably” but if part of that cloud is Amazon’s EC2 then the answer is “probably not”. I know of another SIP based service that has recently started on Amazon’s EC2, unlike sipsorcery they are Linux based, so it will be interesting to follow their experience.

As for sipsorcery I’ve identified a promising alternative to EC2 that I hope to migrate the service to at some stage. One big advantage of this alternative provider is that they have F5 load balancers which would allow the sipsorcery service to be deployed reliably without having to depend on SIP SRV records which have been shown to be pretty poor as a failover mechanism for SIP clients, when a sip1 outage occurs approximately two thirds of the sipsorcery clients drop off and don’t fail over to sip2. However there are some funding challenges involved in migrating sipsorcery from EC2 and I need to come up with some way to pay for the new cloud host.

Aaron

  1. dennis’s avatar

    Hi, I wish I could help. Since you said rebooting system always make it back to normal, may be just to reload DHCP client service or some other network service may also make it back to normal.
    Dennis

    Reply

    1. shaggy shaggy’s avatar

      However there are some funding challenges involved in migrating sipsorcery from EC2 and I need to come up with some way to pay for the new cloud host

      add a paypal donate button to this page.

      Reply

    2. zaheer002’s avatar

      Aaron, I have been using MSS and then SS. I understand that these are experimental services and they will fall over sometimes. I am happy to make a donation to improve the reliablity of the service. Other’s will too.

      Zaheer

      Reply

    3. tom’s avatar

      i would be very happy to donate. how about migrating SS to a sort of non-profit association funded through annual membership dues?

      when i say non-profit i am not suggesting that the engineer(s) not be compensated for their work. i am thinking more in terms of exemption from taxation as a profit making business thus keeping costs low.

      Reply

    4. tom’s avatar

      Aaron,

      just curious. would you mind sharing the name of the other EC2 based SIP service?

      Tom

      Reply

      1. sipsorcery’s avatar

        cloudvox.com

        Reply

      2. Perk’s avatar

        A quick “thanks” for your efforts. Though I am low-volume caller, I have been very happy with the speed, reliability and quality of the service as of late.

        I second the suggestions of a PayPal donate button. It may feel tacky but it is a solution worth trying — especially given the low cost in time and effort to implement.

        Thanks also for sharing your experiences with database services and hosting in the cloud. I find this information most interesting. Though I love the C# framework for programming, and cloud servers are a great idea, the “lack of pain” when using linux on dedicated servers for real-time applications is very nice, too.

        Reply

Reply

Your email address will not be published. Required fields are marked *