Outage 22nd Oct 2009

The sipsorcery server had an outage yesterday. Based on the logs the outage was 5 hours long starting at 1553 UTC until 2056 UTC.

The cause of the outage was the Amazon EC2 instance that the sipsorcery servers run on seemingly losing network connectivity. This is the 3rd (possibly 4th) time this has happened and I’ll be putting a ticket in to Amazon support to see if there is any more information about it since the last ticket.

While it’s annoying and the frequency of the incidents is way too high things do go wrong and servers do crash. For the sipsorcery service to become more reliable it will need to be able to cope with losing a server instance. The work I have been doing on incorporating the Amazon SimpleDB as the sipsorcery’s data repository is with precisely that goal in mind. It will provide a scaleable, reliable (hopefully more than the EC2 instance) and shared data layer that will allow two independent sipsorcery instances to utilise. If one instance drops off for whatever reason the other one would still be available. With the EC2 cloud having expanded into Europe it would mean one sipsorcery instance could run in Amazon’s European data centre and the other in the US one which would hopefully make it very unlikely both instances would have an operating system or hardware issue simultaneously.

There was a bit of Murphy’s Law with this outage as well. I do have monitoring set up for the sipsorcery.com server and get sent an SMS if it stops responding. Last night just as I was going to bed my phone was giving those annoying beeps to indicate the battery was low and since I couldn’t be bothered to go an find the recharger turned it off until morning. Of course 3 hours later the sipsorcery instance lost its network connectivity and an SMS was sent to an off phone. Apart from that it’s debatable whether I would one hear and two get up and check an SMS that arrived at 0300 but going on recent history I probably would. I was up briefly at 0500 to give my daughter back her dummy so I would have spotted it then as well. But as it happened I didn’t become aware of the sipsorcery being down until around 0745 when I checked my office phones and saw they weren’t registered. Ten minutes and a reboot (thankfully EC2 instances can be rebooted through a web browser, there was no other way to communicate with the sipsorcery one) later all was back to normal.

The above paragraph is not what you want to read when considering the support arrangements of your VoIP service but that’s why it’s free :).

  1. gunnelsunder’s avatar

    It’s a great service and a very impressive piece of work, Aaron, despite occasional incidents. It’s not clear paid-for services aren’t always more reliable, either. Having said that the outage did set me thinking about some fault-tolerance in my setup – SIP Sorcery had apparently become a single point of failure…
    Thanks for the update, and keep up the good work!

    Reply

  2. bpere’s avatar

    I agree with the comment above, i.e. despite the occasional glitches, I am impressed by the service level, which IMHO is higher than many other similar services that I actually paid for in the past. The only suggestion I could make is if there was a way to set up automatic email or SMS notifications for those users that would like to have real time information on service availability thus enabling them to act accordingly.

    Reply

    1. sipsorcery’s avatar

      The automatic notification is a tricky one. I’d rather not put in a system that “pushes” notifications as that could snowball. One thing which just sprung to mind is twitter, I’ve never used it but I have seen around the traps that it has an API which may mean the sipsorcery monitoring service could send a twitter message when something goes wrong.

      At the moment I’m still focused on the SimpleDB implementation. It’s a major goal to get the sipsorcery service to five nine’s reliability. The twitter notification would be a perfect task for a new sipsorcery contributor. If you know any C# developers at a loose end send them sipsorcery’s way :).

      Reply

      1. UK_101’s avatar

        I happened to be near my router when SS went away (16:35 BST) and heard its relay click off, so I had a sort of automatic notification. An attempt to log on to SS failed, so I fired up my softphone and logged on directly to all my VSPs. Perhaps those who use SS for serious telephonic communication could simply monitor their own VoIP phone/router display and use alternatives?

        For casual and personal use it’s not too much of an issue if SS dies. If there’s no client logged on all my VSPs will take a message and send an email, so I would know soon enough if someone had been trying to call me. Obviously outgoing calls via SS would fail.

        Reply

      2. sipsorcery’s avatar

        The solution is really to improve the service to a five nines reliability. That’s not particulalrly easy to do but it’s also not rocket science, the blueface.ie SIP services I was invloved in setting up currently exceed it. What makes it particulalry difficult with sipsorcery are:

        1. Budget, to have a secondary EC2 instance ready solely for failover is too expensive. A load balancing instance is feasible since that means the current instance can be switched back to a lesser capacity one. The SimpleDB work is a prerequisite for the load balanced instance,

        2. Cutting edge software. The biggest issue with reliability over the last 15 months has been the memory leaks and other failures in the IronRuby engine. The IronRuby software has not even had an Alfa release so those type of issues are to be expected. It’s taken the 15 months to come up with a solution that allows the sipsorcery application server to be able to accommodate the IronRuby failures,

        3. No 24 x 7 support. As pointed out in the blog post the sipsorcery monitoring solution consists of an SMS to my mobile which may or may not be on and which I may or may not hear.

        Despite those challenges I still believe the sipsorcery software and deployment can be made good enough to overcome them. It gives me something to aim for.

        Reply

      3. XP1’s avatar

        Is sipsorcery still down? I can’t connect using my ATA device. Yesterday was the first time I set up sipsorcery, so I am unsure if it was a configuration problem or server problem.

        Reply

        1. sipsorcery’s avatar

          No it’s up and has been uninterrupted since the reboot at UTC 0800 22 Oct 2009.

          Reply

          1. XP1’s avatar

            Thanks, it must be a configuration problem on my side.

            Reply

          2. Wilson’s avatar

            Shit happens….C’est la vie….Keep up the good work!
            Thanks.

            Reply

          3. UK_101’s avatar

            Not sure if this is an apropriate place for this.

            05:30 Noticed my router’s VoIP led was off, checked with sipsorcery and it showed both lines were bound OK. Checked router log and saw “401 unauthorised”. Started a softphone which registered OK.
            06:00 Was just about to reboot router when it registered. Was something reset?

            Reply

            1. XP1’s avatar

              SIPSorcery was down for me yesterday, but I just tried again, and now it works.

              Thanks for letting me know.

              Reply

Reply

Your email address will not be published. Required fields are marked *