For those that came late one of the goals of mysipswitch and sipsorcery was to try out telecoms in the clouds. Now both “telecoms” and “clouds” are very broad classifications and can mean just about anything depending on who you are speaking to. In this case by “telecoms” I mean being able to provide a highly responsive and highly reliable, ideally five nines or greater, SIP service. By cloud I mean a publicly available infrastructure service at the operating system or database layer that generally has greater granularity than a typical dedicated server and that can be provisioned in minutes rather than days.
The initial manifestation of sipsorcery was called mysipswitch and at the start of its life in late 2006 it was hosted as a Windows Service on Blue Face’s web server and utilised Blue Face’s dedicated Postgresql SAN. Initial investigations to move the service to Amazon’s EC2 started in 2008 but it wasn’t until mid 2009 that the move finally took place. At the same time the database was switched to MySQL and the service was renamed to sipsorcery. After around 6 months of operating solely from EC2 the sipsorcery database was moved to Microsoft’s SQL Azure and the web services and web site to Microsoft’s Windows Azure. Finally in mid May 2010 the sipsorcery SIP services and database were moved to a GoGrid dedicated Windows server and the web site to DiscountASP.net. There have been a lot of trials and tribulations along the way, especially in the case of the EC2 service, and you previous posts in this blog provide an insight into those.
For the remainder of the post I’ll write about the experiences of deploying on Amazon’s EC2 and Microsoft’s Azure services. Those are the only clouds I have had a large amount of experience with and I would argue that in EC2’s case it is the pioneer of the operating system layer cloud and Microsoft’s SQL Azure the same for database layer cloud services.
By far the most stressful period of existence for mysipswitch/sipsorcery’s life so far has been the one in which it was hosted on Amazon’s EC2. During the drawn out testing phase lasting over a year and the first few months after deploying the live service to EC2 things ran as expected. In August of 2010 I whipped up an application so people could place SIP calls using Google Voice, the app was only intended to help a few people out but a side effect was that the sipsorcery service suddenly became a bit more attractive to a wider audience consequently the load on the system increased. I’d already had to unexpectedly upgrade from a small instance to a medium CPU instance shortly after the initial deployment to EC2 which meant any cost saving over a dedicated server was now pretty slim. Still unlike a dedicated server arrangement the EC2 infrastructure had the advantage of no contracts and on demand provisioning should the service need to be upgraded further.
It was around October 2009 that the warning bells started to go off after a series of outages that I couldn’t trace back to any issue with the sipsorcery software. I was keen enough to get the problem sorted out that after not getting anywhere on the public EC2 forums I subscribed to premium support for a couple of months (at $100 per month) so I could log a ticket. After having the sipsorcery server monitored with CloudWatch and trying out the various suggestions from the helpful Amazon tech support I basically got nowhere and neither I nor they were any wiser as to why the outages were occurring. I cancelled the premium support but kept agitating on the public forums by keeping a rolling record of the outages that were occurring. Eventually my persistence paid off and I was contacted by an Amazon engineer who seemed to be a bit more experienced than the technical support reps I had been dealing with previously. He also had a number of suggestions such as trying the latest Windows AMI’s (Amazon Machine Image) and using an EBS (Elastic Block Store) backed image instead of the standard host based storage. It required a fair bit of effort to test out different AMI’s but I dutifully went ahead with each suggestion still deperately keen to resolve the issue and avoid the outages that were plaguing sipsorcery. However eventually there was nothing left to try and while we were able to narrow down the issue as likely being between Windows and the network driver of the hypervisor (which is Xen I believe) there was no solution and the only option I was left with was to migrate back off EC2.
It was a painful experience not just because of the outages but because of the time and effort I had invested in the Amazon cloud infrastructure particularly EC2 and S3. Now that was all going to go to waste because the service was unusable. I’d already deployed a second EC2 instance to cope with the failures and at this point the running costs were around USD600/month which it’s fair to say were a lot higher than the USD50 I had envisaged when the service was originally deployed to EC2.
The conclusion from the sipsorcery Amazon EC2 experience is that the EC2 platform is not suitable for software that needs to utilise a large number of network connections on top of a Windows OS. Regardless of whether it’s a problem with the network driver in the Xen software there is an issue at the network layer that causes Windows instances to lose network connectivity. The sipsorcery experience was not unique and I read of and had explicit confirmation in the EC2 public forums from others experiencing the same problem.
Originally the sipsorcery web services were deployed on the same EC2 instance as the SIP services. When the EC2 outages started occurring and a second EC2 instance was brought into service the web service needed to be moved otherwise the service could be left in a state where the failover SIP server was up but the web server was down. With SIP a failover mechanism of SRV records can be used, although the SRV record mechanism had its own set of problems but that’s the topic for another post, however HTTP does not have any equivalent so the best option was to move the web services to an alternative platform.
At the time Windows Azure was just about to come out of beta and always being keen to test out new cloud services I decided it was a good candidate. The web services survived on Windows Azure for just under 6 months. In its case I didn’t encounter any major technical issues and instead the two big shortcomings were the cost of the service, it was costing over USD100 a month once the hosting, network and storage costs were taken into account which is roughly 4 times greater than what it’s costing for the same thing with the new specialist ASP.Net hoster. Apart from cost the Windows Azure deployment process is incredibly painful for this day and age. To start with you have to create a special project in Visual Studio then upload two files manually though a browser interface. you also have to decide whether you’d like the service to be unavailable for the duration of the process, which in my case was often well over 10 minutes, or whether to deploy into a staging environment first and then swap staging with production. I always chose the latter but it meant a few additional steps and some additional costs. I quickly found that hosting a web site on Windows Azure was not a good idea, a single spelling mistake in a web page could result in a 10 minute upgrade. Compare that to the new One-Click Publish mechanism supported by IIS7 which the current sipsorcery web hosting provider supports and means the whole web site can be deployed with a single button click from within Visual Studio and if it’s a single HTML file the publish will be done in seconds. Beats me why Windows Azure doesn’t support the one-click publish mechanism and I guess they eventually will but even then the cost of the service is still likely to be unattractive.
The conclusion for Windows Azure is that it’s too expensive and the deployment mechanism way too clunky for a web application. It doesn’t support UDP so it’s not suitable for SIP applications.
The hardest part of scaling an internet application is at the database layer. Replication and load balancing are things that modern day databases still struggle with. Postgresql doesn’t have a great solution (I’ve personally gone through the dramas of using log shipping and it was not fun), MySQL has a good replication solution but for failover you need 6 servers which is prohibitive. Oracle and Microsoft SQL have their own high availability solutions but the licensing costs are horrendous.
When I first read about and started trying out SQL Azure I thought it was a gift from the Gods. Not only does it purportedly take care of all those really painful database challenges such as replication, load balancing and fail over but it does so under the hood and for a database of 1GB or less it only costs USD9.99/month (that’s ten dollars just to avoid possible confusion about typos). Unlike Windows Azure there are no deployment challenges to deal with, SQL Azure can be connected to with the standard SQL Server Management Studio tool which is freely downloadable from Microsoft and connecting to it from an application is just the same as any other SQL database.
The sipsorcery database was migrated to SQL Azure in January 2010. It lasted for less than six months. During that period there were some minor technical issues that did affect the sipsorcery service. However the SQL Azure SLA does quote a monthly availability of 99.9% or 43.2 minutes per month and the outages were always a lot less than that. In addition the minor issues were not unexpected both because the service was new, only going into production in January 2010, and given the significant technical challenges the service is dealing with. The hope was that as the months went on the reliability would increase and over time the SLA would be increased to 99.99% which would get it closer to the five nines generally accepted for a telecoms service.
Unfortunately my hopes were yet again to be dashed. When the sipsorcery service was migrated off EC2 and onto a single dedicate server it resulted in an increase in the number of database connections being created to SQL Azure from a single IP address. Almost immediately after the move the sipsorcery processes would stop being able to connect to the SQL Azure database for anywhere from 15 to 60 minutes. To keep things running I was forced to turn off a number the less critical sipsorcery services such as the Registration Agent which meant no 3rd party registrations were being carried out and in a lot of cases incoming calls from SIP providers would not work. This time I was hopeful of getting a resolution as it seemed to me the cause was almost certainly a firewall or some kind of security mechanism in front of SQL Azure and also because unlike in the case of Amazon EC2 this time the support engineers were working for Microsoft the same company that wrote the software for SQL Azure so there was an escalation path all the way to the engineers cutting the code.
In the email to SQL Azure support to log the initial issue on the 24th of May 2010 in regards to I wrote:
It looks very much like a denial of service protection mechanism on the SQL Azure end
On the 28th of May 2010 and a series of over 20 emails later, quite a few of which were repeated suggestions on how to use connection pooling or best practices for SQL Azure, the diagnosis from technical support was:
…that SQL Azure is treating it as a Denial of Service (DOS) attack and resets the connection…
And that the issue would be escalated. A week later I got an email from a new SQL Azure technical representative stating he would be taking over the issue and the latest response on the 17th of June 2010 from an email exchange of another 8 emails was:
I went through our product specifications and found that the DOSGuard disconnects connections if there are repeated failed login attempts from a particular IP address. Currently there is no option to disable DOSGuard for a particular client
And another link to SQL best practices… As in the Amazon EC2 case technical support have been very responsive but I do find it a bit ironic that it’s taken so long to diagnose what was an obvious enough problem that I could make a correct educated guess on it from the outset. What would actually be useful is to know what the “DOSGuard” rules are to see if there is anyway the sipsorcery connection strings can be configured in a way to not run foul of them. I doubt there is though as the connection block was happening after between 15 and 21 connections were established and currently the sipsorcery services have just under 50 connections established with the local MySQL database that replaced the sipsorcery SQL Azure database.
The conclusion as far as SQL Azure goes is that it’s a potentially ground breaking service that has an awesome pricing structure and solves some really difficult problems. However it appears to be targeted exclusively or at least predominantly at web applications and for an application such as sipsorcery which is a bit left of field it doesn’t fit into SQL Azure’s operational parameters. I suspect that over time as SQL Azure matures issues such as those experienced by sipsorcery will be ironed out and the DOSGuard rules will get more sophisticated and not generate false positives at such low rates as 20 connections. Hopefully sipsorcery will be able to go back to SQL Azure or an equivalent product in the future because running a standalone database, as is currently the case, is far from an ideal solution.
So ends the story of sipsorcery’s travels in the clouds at least for now. The service went from a single server deployment to a one encompassing 4 different servers on 3 different clouds and then almost back to where is started from. The moral of the story is that running a telecoms service from a public cloud service may presnetly be a stretch too far. Maybe sticking to Linux based images on EC2 would make it more feasible but my feeling would be even in that case it would be a struggle. Public clouds are the way of the future simply because of cost and flexibility – it took me a month to negotiate the contract for sipsorcery’s current dedicated server – and while they are undoubtedly suitable for a huge range of applications right now for services that need to be highly responsive and maintain 99.999% uptime they still have some maturing to do.
On the up side in the month since sipsorcery has gone back to a dedicated server and MySQL database there has only been one minor outage and that was related to an issue with the sipsorcery software rather than the infrastructure. That means a lot fewer tweets and a much better service.