What went wrong this time

An outage of the sipsorcery service occurred for almost exactly 48 hours between the 11th and 13th of November. The cause of the outage is not exactly known but it is the same as the previous 5 outages of which the most recent was on the 22nd of October. I’m pretty sure the issue is at the operating system level and possibly something to do with the Windows virtualisation configuration being used by Amazon’s EC2 cloud. I’ve had a ticket open on the issue with Amazon since the first instance but they have not been able to identify anything wrong and apparently the issue isn’t occurring for anyone else. The last message in the Windows event log prior to this and the other outages is along the lines of:

11/12/2009 11:27:18 AM: EventLogEntry: Error 11/12/2009 11:27:13 AM Dhcp Your computer has lost the lease to its IP address 10.248.58.129 on the
Network Card with network address 123139023573.

Which is seemingly fairly clearcut but neither I nor Amazon support have been able to work out why the DHCP lease attempt fails. In addition since the last incident I have turned on firewall logging for the sipsorcery server’s Windows firewall to see if it could shed any further light on it. From looking at it there is a big gap of over 7 hours where there are no messages logged which I would guess means the network subsystem has been shutdown altogether but the rest of the time there are a lot of connections being established to the DNS server and it’s a mystery why the sipsorcery SIP and other traffic could not be sent or received.

As to why I wasn’t around to fix it I was on a 3 day break and more by design than chance happened to be somewhere where there was no electricity grid let alone mobile signal or internet.

[googlemaps http://maps.google.com/maps?f=q&source=s_q&hl=en&geocode=&q=bruny+island,+tasmania&sll=-43.493904,147.141352&sspn=0.27148,0.545197&ie=UTF8&rq=1&ev=zi&radius=13.67&hq=bruny+island,+tasmania&hnear=&ll=-43.493904,147.141352&spn=0.27148,0.545197&output=embed&w=425&h=350]

I wasn’t expecting an incident in the 3 days I was away as statistically they have been averaging about one a month and it would be unlucky for that one time to conincide with me being away however unfortunately that’s what happened.

As to what’s being done about it the answer is in the previous post about incorporating the Amazon SimpleDB as the storage layer. Without repeating that and earlier posts once that job is done it will be possible to have two redundant sipsorcery servers running so if an operating system incident like this occurs then the other server will still be available. It’s a big job and goes beyond just switching the data access layer software, for example a number of the sipsorcery services, such as the monitoring, need to be aware of the different instances running on each server. I’ve been working on these tasks flat out for over 2 months now and am getting there.

The other question that could be asked is why stick with Amazon’s EC2 if this issue is at the OS layer and Amazon support can’t help identifying it. That is something I have pondered a fair bit as well. The Amazon EC2 instances aren’t that cheap at the end of the day and there are other compute cloud environments out there. However the Amazon EC2 infrastructure is the oldest and therefore most mature of the clouds and also has by far the best strategy with new services being regularly introduced. I also suspect that shifting to another cloud could just as easily involve introducing the same sort of operational issue and given the amount of effort I have already put into working with the Amazon infrastructure it’s definitely a case of “better the devil you know”.

Finally this does really highlight how vulnerable the sipsorcery service is due to having only one developer/administrator. This particular issue is solved by a reboot of the server. It’s not as simple as giving someone a username and password so they can remotely access and reboot the server. Anyone with that access can potentially gain access to all the sipsorcery user information so it needs to be a suitably trusted person. Ideally what I’m hoping for is a C# developer with an interest in SIP/VoIP to come along and once a level of trust has been established and they have shown they understand the technology so that they don’t go rebooting everytime someone posts about an ATA issue that person would be given admin rights to the sipsorcery server(s). That being said I’m open to any other suggestions about how the sipsorcery service could be run or administered for the benefit of everyone provided any such suggestion takes into account the need for a high level of trust and security.

Addendums

    Monitoring, heartbeat etc: The sipsorcery server is externally monitored by a completely separate virtual server running in Dublin, Ireland (it’s an extra job on the blueface.ie monitoring server). I get an SMS and email whenever the server does not respond to 10 consecutive SIP OPTIONS requests that are sent to it every 5 seconds. Most of the time I will then investigate the issue to check if it’s transient, network related or some other anomaly and then if needed reboot the server. Automatic server reboots based on an external condition(s) are a BAD idea. One it means the issue will be left unresolved since it’s easier to just let the reboot handle it and two the server can end up in an endless reboot cycle if an unforseen combination of circumstances occur.
    DHCP: The Amazon EC2 (Elastic Compute Cloud) allocates dynamic IP addresses via DHCP to all virtual hosts. There is no way to circumvent DHCP with EC2. In the past static IP’s were available and it was actually a bit of a headache for sipsorcery to be modified to work behind the Amazon NAT since the SIP protocol is very inept at dealing with it. There is also no way to check network cards, cables etc, at least not by anyone except Amazon’s data centre staff. The server sipsorcery runs on is a virtual instance that shares the underlying physical hardware with other virtual instances on Amazon’s EC2. According to the support ticket I logged with Amazon the physcial hardware has been checked and it is operating correctly. As to why the same DHCP issue keeps cropping up neither they nor I know but my bet would be that it’s software not hardware related.
    3rd party registrations disabled: A number of people have noted that when the sipsorcery server came back up a number of their 3rd party registrations had been disabled with an error message that the provider host could not be resolved in DNS. This behaviour is by design and is necessary. I still find it amazing what ends up in certain fields for provider information and invalid and non-existent hostnames can result in a lot of unecessary work by the sipsorcery registration agent. In this case the providers disabled had genuine host names but because of the networking issue on the Amazon EC2 instance DNS resolutions appear to have been sporadically failing and providing false results to the sipsorcery registration agent.
  1. deeknow’s avatar

    Aaron

    thanks for your efforts; it’s a pity this outage coincided with your time away, but that’s life.

    Happy to have you back, happy to have SS up and running again, and I support your decision to continue with Amazon until you have more information about the cause of the recent outage.

    cheers

    Reply

  2. hongkongpom’s avatar

    Hi Aaron,

    Could a heartbeat service be installed on the server that would enable a remote reboot?

    Thanks for all the blood, sweat & tears over supporting this great system. I can see that for the most part it’s a thankless task, but on behalf of all of us, THANKYOU VERY MUCH INDEED FOR YOUR HARDWORK!

    Reply

  3. benifa’s avatar

    Thank you Aaron. 🙂

    Reply

  4. hongkongpom’s avatar

    Seems there are still some teething problems with the SIP signalling. I tried a test call back to my landline, and it took about a minute for my phone to start ringing, so it seems there is delay in sending the SIP invites to my VSP.

    Speaking of VSPs, I noticed that two of my VSPs had dropped of registration so I had to re-register them.

    Reply

  5. Gleb’s avatar

    Aaron, may be You need turn off DHCP and use static address? For example, my skype videophone ASUS SV-1 can’t utilize DHCP – it losing connection after several hours , when it configured as DHCP client. After configuring it with a static IP address problem was gone.

    Reply

    1. sipsorcery’s avatar

      Unfortunately static IP’s are not available on Amazon’s EC2 cloud it’s dynamic only.

      Reply

    2. ptheys’s avatar

      Hi Aaron,

      Thanks for all your effort man, I’m very happy with all work you have been doing. I know that you’re not planning to move out from Amazon EC2, but sometime ago I was looking for a place to host a customer’s application and we choose GoGrid services. We’re hosting that application for 6 months without an unexpected outage.

      Abraço!

      Reply

      1. sipsorcery’s avatar

        6 months with no issues, that’s tempting… I had a quick look at GoGrid. Do you know if they give full access to the Windows OS, can Windows Services be installed? Did you install a database and was it redundant? I’ll find the answers myself tomorrow if needs be just being a bit lazy :).

        Reply

      2. hongkongpom’s avatar

        Ok, so this is a DHCP issue.

        Some suggestions:

        Do other PCs on the same network lose their IP address?

        What is the DHCP renew time. Does the server go offline after the first DHCP expiry or does it make it through a few times?

        Can you try another NIC or could it be the drivers?

        Can you configure it with a static IP as servers shouldn’t use DHCP in order that another box isn’t relied on to get network connectivity.

        Reply

      3. hongkongpom’s avatar

        The SIP invites are going quicker now. Must have been the server bogged down with user agents all registering at once as well as new calls being put though.

        Reply

        1. sipsorcery’s avatar

          It wasn’t the server, it’s average utilisation is currently under 30%. I observed some issues connecting as well though. Running a ping from my home connection in Australia showed a bit of packet loss and the remote desktop connection was dropping off a bit. Running a ping from Ireland (via a remote server I have access to there) had no packet loss. So it seems the connectivity problem was somehow network related the question being whether it was the internet or Amazon’s network. Something else I need to keep an eye on.

          Reply

        2. UK_101’s avatar

          Welcome back from your break. We all need to take time out once in a while.

          With reference to the open ticket with Amazon, as they are aware of the issue, is it possible that they could be persuaded to monitor the SS server and reboot it?

          Reply

        3. arshad’s avatar

          I’ve been using MSS and later SS for some time. Very happy with the feature and functionality provided. Thanks for all the good work!

          Re: recent outage, based on the error message, it is definitely an issue with the network connectivity. Quick search on the error, there are suggestions to check network cable, connection, network card, drivers. Did it crashed with a Blue screen?

          My work place has similar setup where we’ve 100 of websites running on VM. Things do go wrong from time to time, we use heart beat monitoring to reset the server if in error state.

          I would also like to offer my help to reset the server if there is an error condition. I don’t have background in C#, but had programming experience with C and C++. Been experimenting with SIP/VoIP for few years now.

          Again, thanks for all the effort to bring this free product to the world!

          Reply

        4. OTech’s avatar

          Should we need a service run on windows box and monitor the SS?
          If, for example user registration drop rapidly or dial-plan execution takes too long or no internet, then the windows box will auto restart!

          Reply

        5. frenchfry’s avatar

          Thanks for the service. I like the idea of the backup server. Good luck on implementing it. That still will not keep up time to 100%, but it will help a lot.

          Reply

        6. hongkongpom’s avatar

          Good news…the lag to connect the call is gone 🙂

          Reply

        7. Mike Telis’s avatar

          Regarding disabling of 3rd party registrations when the agent can’t resolve SIP proxy name, wouldn’t it be wiser to keep trying (say, every 10 minutes) for about an hour?

          The algorithm could be like this:

          Check if the name is in the list of unresolved domains. If yes, check the time of last check and if it’s less than 10 minutes ago, exit. If name is not in the list, try resolving it. If success, remove name from the list and proceed with registration. If DNS entry not found, add to the list and/or update time the name was checked.

          If it’s been more than, say, 1 hour since the first check, disable registration and remove name from the list.

          If you implemented something like this, it wouldn’t require much changes to the rest of registration agent code. All requests to the same SIP proxy server name will be postponed for 10 minutes after 1st unsuccessful attempt.

          Sincerely,

          Mike

          Reply

          1. sipsorcery’s avatar

            I replied to your post on the forums re this topic. You are right that it would be easy to adjust the registration agent to operate the way you have described in fact it would mean changing just two constants. However I prefer the current mechanism of retrying every 3 minutes and if 6 consecutive DNS resolution failures occur then disabling the registration. Most DNS failures are transient and resolve themselves in under a minute so 10 minutes between retries is too long.

            Aaron

            Reply

Reply

Your email address will not be published. Required fields are marked *