Occassional posts about VoIP, SIP, WebRTC and Bitcoin.
The SIPSorcery server experienced a 90 minute outage on the 15th of August at approximately 1700 PST. The first purpose of his blog post is to apologise to all people affected by the outage. The second is to provide a summation of the technical information that has been gathered about the cause of the outage.
The server event logs corresponding to the outage had numerous log messages as below.
Event ID: 333
An I/O operation initiated by the Registry failed unrecoverably. The Registry could not read in, or write out, or flush, one of the files that contain the system’s image of the Registry.
From lengthy research it appears the error message can be caused by a variety of error conditions on a Windows server but the predominant one is typically related to an exhaustion of resources on the underlying operating system. The resource could be pooled memory, handles, registry size limits and others. So far the only way found to recover from the issue is to reboot the server which is what happened with this outage.
One question is why would this issue crop up now when the SIPSorcery server has been running in the same configuration for over a year. The answer to that may lie in the fact that the server hardware was recently upgraded and at the same time the MySQL database version was updated from 5.1 to latest 5.5 version. It could be that a different hardware configuration, the new MySQL software, a combination of them or something else entirely has caused the issue to crop up.
The SIPSorcery server is closely monitored on a daily basis for performance characteristics such as CPU utilisation, memory utilisation, threads, SIP traffic, disk IO and more. However in this case no pre-emptive signs were recognised. At this point the prime suspect is the MySQL service and more detailed monitoring has now been put in place in order to track the resource usage of the MySQL process.
The short term goal is to identify the cause of the issue, whether it be related to MySQL or otherwise, and fix it. The medium term goal is to look into adding hardware redundancy to the SIPSorcery service. There will always be issues with server hardware, operating systems etc. and a single server system will always be vulnerable. Up until this point with the SIPSorcery service operating largely as a free offering it was not viable to add additional hardware to the platform. Now that the service is generating some revenue from Premium accounts there is scope to look at enhancing the platform. I will keep the blog updated with developments as they arise.
I apologise again to any users affected by today’s outage and a similar but shorter one on the 8th of August and would like to assure users that reliability is the top priority of the SIPSorcery service and is the focus of the majority of my efforts. There are also real-time status updates regarding the availability of the SIPSorcery service on this blog site at Status Graphs. A red line on the monitoring graphs indicates a simulated call request to a SIPSorcery application server timed out after 15s or did not get an appropriate response. There are now 3 remote monitoring servers sending probes to the SIPSorcery server and the fourth graph is for a monitoring service that runs on the server itself in order to distinguish between network and service problems. And in the event that an outage does occur I always endeavour to issue updates as frequently as possible on the SIPSorcery twitter account.
I’ve been working with telco software for the last 7 years and by far far and away the most painful part on the server side is a reliable database set up. NAT would take the cake for the most painful thing overall. This post was inspired by the sipsorcery MySQL service crashing yet again.
I have the service configured to automatically restart so the crash would have hardly been noticeable.
In the last couple of months the previously rock solid MySQL instance on the sipsorcery server has become increasingly flakey and the process is now crashing on average once a week. From the error logs the suspect is something to do with MySQL’s SSL handling and I did find an issue on the MySQL bug list that “may” be the cause. I need to schedule time to upgrade to the latest MySQL version but that’s not a trivial matter and will probably mean between 30 and 60 minutes of downtime.
The recent problems I’ve had with MySQL highlight a current theme I’ve experienced throughout my time working with VoIP/telecommunications software and that is building and maintaining a five 9’s reliability database is either very expensive or very tough.
The database I used for my first VoIP platform was Postgresl. All in all it was pretty reliable but issues did crop up. In one case the data files got corrupted due to a Postgresql bug handling Unicode on Windows. After recovering from that we moved to a Linux platform and suffered a couple of outages due to hardware and operating system issues. That led us to bite the bullet and spend a lot of money (for a small company) on a Solaris SunFire server and storage array. That ended up being disastrous; first a firmware fault in the fiber channel controller on the storage array caused some major outages and lots of messing around to replace and after sorting all that out there were a few kernel panic incidents that I can’t recall the cause of. The problem with Postgresql at the time is that it didn’t have a replication/fail over solution. There were side projects which we tested out but they were all immature and some introduced prohibitive performance penalties. We ended up using archive log shipping where the transaction logs from our main server were copied over to a standby Linux Postgresql server. Unfortunately that caused a couple of outages as well; the disk on the Linux server filled up and the Linux Postgresql instance stopped applying the transaction logs and that caused the primary Postgresql instance on the Solaris server to shutdown to preserve the integrity of the data. Eventually things settled down but it was serious enough at the time that it was considered a threat to the survival of the business.
The Solaris experience caused us to appreciate even more how important a reliable database was so we had a chat to Oracle. They promised a multi-server, real-time fail over system but the price was exorbitant and way out of our league. We were desperate enough to consider it at the time but in the end common sense prevailed and we soldiered on with Postgresql.
With mysipswitch and sipsorcery it was a chance to try and find a better solution is a less demanding environment. By that I mean if something went wrong it was an inconvenience to people but they were warned in advance that the system was experimental and came with no guarantees. The mysipswitch service actually spent it’s whole life using the same Postgresql database I mentioned above and it was only when the service morphed into sipsorcery that a different database approach was attempted. The sipsorcery service was intially deployed for a very unhappy year on Amazon’s EC2 infrastructure. Initially only a single server deployment using a local MySQL database was used. When the EC2 instance started going down every second day the deployment model changed to two servers with an SQL Azure database. I actually thought at the time SQL Azure was finally the perfect solution. It was cheap at $10 a month and all the hard things about running a database were taken care of by Microsoft. However there were a few glitches that caused 5 minutes of downtime here and there and at the time I put it down to the fact that the service was brand new, this was in Jan 2010 when SQL Azure had only just been opened for service. The small outages were bearable but the real problem came when i finally had to give up on EC2 and move to a dedicated server. At that point the SQL Azure database I was using started getting the connection requests that were previously spread over two EC2 instances from a single dedicated server. It wasn’t a huge number, between 20 and 30, but it was enough to cause the SQL Azure Denial of Service software (called DOSGuard apparently) to drop all connections for up to an hour. SQL Azure support were happy to send me emails back and forth for nearly 3 months about the issue but in the end it was something they couldn’t or wouldn’t fix. In my opinion it’s a strange limitation and one that probably stems from the fact that SQL Azure is mainly pitched as being a solution for web applications. Apparently the sipsorcery software was getting flagged because the connections were coming from seven different processes.
So after SQL Azure it’s back to a local MySQL instance on the dedicated sipsorcery server. It’s almost tempting to switch it to Postgresql to complete the circle.
At one point I did look at NoSQL options like Amazon’s SimpleDB but the latency was a killer with it taking almost a second for even the simplest queries. I also checked out some of the other similar offerings but they all appeared geared up for web applications where response times of up to 500ms aren’t a problem. For sipsorcery the response times need to be well under 100ms.
I also know MySQL has replication and load balancing options and I have explored them. The problem is to get automatic failover it needs something like 6 servers. Without automatic failover there’s not a lot of benefit for sipsorcery to replicate data to a standby node. That means Amazon’s RDS service is also not a great option.
I’d love to hear if anyone knows of any other type of service out there that might be worth looking into?
After 25 days of no EC2 issues there have now been 3 failures in the last 24 hours including one 10 minutes ago. There has always been at least one of the sip1 or sip2 up at all times during this latest batch of outages so people with SIP clients that support SRV records should not have noticed any issues.
When an outage occurs I get an email and SMS alert and if I’m in the vicinity will reboot the inaccessible EC2 instance to restore it to service, generally that process can take about 30 minutes. If I’m not in the vicinity it will obviously be longer.
For anyone that wants to track the actions I take to recover from failures I’ve created a sipsorcery twitter account where I will post messages relating to the service status.
At the moment I’m unable to properly connect to the sipsorcery services or any other Amazon sites I have tested including www.amazon.com. I’m situated on the bottom of Australia (literally next stop Antarctica, the ice breakers sail past my house on the way out) so whatever network issue is causing the problem is hopefully not affecting too many people elsewhere around the globe. From the looks of the forums no one else has noticed an issue as normally there’s a big spike in activity whenever sipsorcery has a problem.
On a ping test to the sipsorcery SIP servers the packet loss is about 95% which is pretty weird but does indicate it’s probably a routing glitch somewhere rather than a cable cut. Hopefully it will clear itself up in a few hours. There seems to have been a few fun and games around in the last 24 hours WordPress downtime and I wouldn’t be surprised if the issue I’m experiencing has cascaded down from whatever caused that.
I was hunting around on the Amazon EC2 forums regarding an issue I’m having bundling a new AMI and decided I’d do a quick search to see if anyone else was having issues with unresponsive instances caused by DHCP leases. Lo and behold there are quite a few and they seem to be growing. This thread is a fairly typical example Instance not responding. I didn’t think to search the forums previously, which in hindsight was pretty silly, as I was logging the issue with Amazon premium support and assumed they would be a better source of information than the public forum.
It’s been a tough battle to get the sipsorcery server stable with the memory leak issues in the DLR or IronRuby library, it’s not clear which depsite some extensive investigations, so to finally solve that only to then have the underlying server start causing issues is annoying to say the least.
Despite the EC2 issues it’s definitely a worthwhile goal to have two redundant sipsorcery servers so I’ll keep owkring towards that but if the instability on the EC2 continues then I may have no choice but to migrate as soon as something better becomes apparent.
Update 16 Dec 2009: Had yet another re-occurrence of the sipsorcery.com EC2 instance losing its DHCP lease and becoming inaccessible. I’m going to add every incident to this thread on the EC2 forums Instance not responding.
An outage of the sipsorcery service occurred for almost exactly 48 hours between the 11th and 13th of November. The cause of the outage is not exactly known but it is the same as the previous 5 outages of which the most recent was on the 22nd of October. I’m pretty sure the issue is at the operating system level and possibly something to do with the Windows virtualisation configuration being used by Amazon’s EC2 cloud. I’ve had a ticket open on the issue with Amazon since the first instance but they have not been able to identify anything wrong and apparently the issue isn’t occurring for anyone else. The last message in the Windows event log prior to this and the other outages is along the lines of:
11/12/2009 11:27:18 AM: EventLogEntry: Error 11/12/2009 11:27:13 AM Dhcp Your computer has lost the lease to its IP address 10.248.58.129 on the
Network Card with network address 123139023573.
Which is seemingly fairly clearcut but neither I nor Amazon support have been able to work out why the DHCP lease attempt fails. In addition since the last incident I have turned on firewall logging for the sipsorcery server’s Windows firewall to see if it could shed any further light on it. From looking at it there is a big gap of over 7 hours where there are no messages logged which I would guess means the network subsystem has been shutdown altogether but the rest of the time there are a lot of connections being established to the DNS server and it’s a mystery why the sipsorcery SIP and other traffic could not be sent or received.
As to why I wasn’t around to fix it I was on a 3 day break and more by design than chance happened to be somewhere where there was no electricity grid let alone mobile signal or internet.
[googlemaps http://maps.google.com/maps?f=q&source=s_q&hl=en&geocode=&q=bruny+island,+tasmania&sll=-43.493904,147.141352&sspn=0.27148,0.545197&ie=UTF8&rq=1&ev=zi&radius=13.67&hq=bruny+island,+tasmania&hnear=&ll=-43.493904,147.141352&spn=0.27148,0.545197&output=embed&w=425&h=350]
I wasn’t expecting an incident in the 3 days I was away as statistically they have been averaging about one a month and it would be unlucky for that one time to conincide with me being away however unfortunately that’s what happened.
As to what’s being done about it the answer is in the previous post about incorporating the Amazon SimpleDB as the storage layer. Without repeating that and earlier posts once that job is done it will be possible to have two redundant sipsorcery servers running so if an operating system incident like this occurs then the other server will still be available. It’s a big job and goes beyond just switching the data access layer software, for example a number of the sipsorcery services, such as the monitoring, need to be aware of the different instances running on each server. I’ve been working on these tasks flat out for over 2 months now and am getting there.
The other question that could be asked is why stick with Amazon’s EC2 if this issue is at the OS layer and Amazon support can’t help identifying it. That is something I have pondered a fair bit as well. The Amazon EC2 instances aren’t that cheap at the end of the day and there are other compute cloud environments out there. However the Amazon EC2 infrastructure is the oldest and therefore most mature of the clouds and also has by far the best strategy with new services being regularly introduced. I also suspect that shifting to another cloud could just as easily involve introducing the same sort of operational issue and given the amount of effort I have already put into working with the Amazon infrastructure it’s definitely a case of “better the devil you know”.
Finally this does really highlight how vulnerable the sipsorcery service is due to having only one developer/administrator. This particular issue is solved by a reboot of the server. It’s not as simple as giving someone a username and password so they can remotely access and reboot the server. Anyone with that access can potentially gain access to all the sipsorcery user information so it needs to be a suitably trusted person. Ideally what I’m hoping for is a C# developer with an interest in SIP/VoIP to come along and once a level of trust has been established and they have shown they understand the technology so that they don’t go rebooting everytime someone posts about an ATA issue that person would be given admin rights to the sipsorcery server(s). That being said I’m open to any other suggestions about how the sipsorcery service could be run or administered for the benefit of everyone provided any such suggestion takes into account the need for a high level of trust and security.
Addendums
The sipsorcery server had an outage yesterday. Based on the logs the outage was 5 hours long starting at 1553 UTC until 2056 UTC.
The cause of the outage was the Amazon EC2 instance that the sipsorcery servers run on seemingly losing network connectivity. This is the 3rd (possibly 4th) time this has happened and I’ll be putting a ticket in to Amazon support to see if there is any more information about it since the last ticket.
While it’s annoying and the frequency of the incidents is way too high things do go wrong and servers do crash. For the sipsorcery service to become more reliable it will need to be able to cope with losing a server instance. The work I have been doing on incorporating the Amazon SimpleDB as the sipsorcery’s data repository is with precisely that goal in mind. It will provide a scaleable, reliable (hopefully more than the EC2 instance) and shared data layer that will allow two independent sipsorcery instances to utilise. If one instance drops off for whatever reason the other one would still be available. With the EC2 cloud having expanded into Europe it would mean one sipsorcery instance could run in Amazon’s European data centre and the other in the US one which would hopefully make it very unlikely both instances would have an operating system or hardware issue simultaneously.
There was a bit of Murphy’s Law with this outage as well. I do have monitoring set up for the sipsorcery.com server and get sent an SMS if it stops responding. Last night just as I was going to bed my phone was giving those annoying beeps to indicate the battery was low and since I couldn’t be bothered to go an find the recharger turned it off until morning. Of course 3 hours later the sipsorcery instance lost its network connectivity and an SMS was sent to an off phone. Apart from that it’s debatable whether I would one hear and two get up and check an SMS that arrived at 0300 but going on recent history I probably would. I was up briefly at 0500 to give my daughter back her dummy so I would have spotted it then as well. But as it happened I didn’t become aware of the sipsorcery being down until around 0745 when I checked my office phones and saw they weren’t registered. Ten minutes and a reboot (thankfully EC2 instances can be rebooted through a web browser, there was no other way to communicate with the sipsorcery one) later all was back to normal.
The above paragraph is not what you want to read when considering the support arrangements of your VoIP service but that’s why it’s free :).
It’s now been 3 weeks since the Isolated Process dial plan processing mechanism was put in place on the sipsorcery service. The news on it is good and while there were a few tweaks required in the first couple of weeks, which were more down to preventing some users initiating 20+ simultaneous executions of their dialplans, in the last week there have been no software updates or restarts required. During that time the sipsorcery application server, which processes the dial plan executions and has been the trouble spot, operated smoothly with no issues.
As discussed ad-nauseum in the past the root cause of the reliability issue on the services is a memory leak either in the Dynamic Language Runtime (DLR) or in the integration between sipsorcery and the DLR. The solution has been to isolate the processing of the dialplans in separate process and perioidcally recycle those processes.
I now feel pretty comfortable about the reliability of the sipsorcery application server and am reasonably confident that a solution to the instability issue that has plagued mysipswitch and sipsorcery has been found, at least for sipsorcery. As also mentioned previously the mysipswitch service cannot be easily updated anymore since the code has diverged significantly since it’s last upgrade in November of last year. I would now recommend that people migrate from mysipswitch to sipsorcery for greater reliability. There were two cases where the mysipswitch service needed to be restarted in the last week due to the “Long Running Dialplan” issue and a failed automated restart. On average the mysipswitch does need one restart a week. If the restart happens to coincide with times when I or Guillaume are able to access the server, which is when we are not asleep and in my case at work, it’s fine. If it’s outside those times it can be up to 8 hours.
Update: Of course no sooner had I posted about stability there was a problem. Approximately 5 hours after posting the above the dial plan processing on the Primary App Server Worker failed with calls receiving the “Long Running Dialplan” log message. The memory utilisation of the App Server was low, around 120MB, and the process was responding normally, if it was not the Call Dispatcher process would have killed and recycled it. The thing that was failing was script executions by the DLR. This provides some new information and it now looks like there are two separate issues with dialplan processing. One is a memory leak when a process continuously executes DLR scripts. The second is a bug in the DLR that causes it to stop processing scripts altogether and possibly the result of an exception/stack overflow in a script. The memory leak issue has been resolved by recycling the App Server Workers when they reach 150MB. An additional mechanism is now needed to recycle the process if script executions fail.
Approximately an hour ago the Amazon instance hosting the sipsorcery service froze/crashed. It happened as I was attempting to clean up the cdr’s database table by deleting the older cdr records. There were nearly half a million call records in the table which was hampering the ability to scroll through them in the Silverlight interface. It was a fairly innocuous operation but unfortunately it’s had some significant consequences.
The bad news is that at this point it looks like all the data from the sipsorcery service has been lost. That includes all user accounts, SIP providers, dial plans etc. The reason it has been lost is that I made a mistake when setting up the MySQL database and the data directory was sitting on the C: drive and not the F: drive. The F drive in this case was an Amazon ElastiC Block Volume (EBS) which survives across instance reboots. Somehow when I installed MySQL I configured it so that only the directory for the MySQL binaries was on the F volume but the data directory was on the C drive.
The good news is the service is back up and running (now with the MySQL database on the F: drive). It’s a relatively quick operation to recover in the event of a crash like this since the IP address and volume can be re-attached to a new instance very easily.
Unfortunately anyone wanting to use the sipsorcery service will need to re-create their account and dialplans etc. I realise it’s a real pain and I lost all my settings as well so am also suffering.
Sorry,
Aaron