An outage of the sipsorcery service occurred for almost exactly 48 hours between the 11th and 13th of November. The cause of the outage is not exactly known but it is the same as the previous 5 outages of which the most recent was on the 22nd of October. I’m pretty sure the issue is at the operating system level and possibly something to do with the Windows virtualisation configuration being used by Amazon’s EC2 cloud. I’ve had a ticket open on the issue with Amazon since the first instance but they have not been able to identify anything wrong and apparently the issue isn’t occurring for anyone else. The last message in the Windows event log prior to this and the other outages is along the lines of:
11/12/2009 11:27:18 AM: EventLogEntry: Error 11/12/2009 11:27:13 AM Dhcp Your computer has lost the lease to its IP address 10.248.58.129 on the
Network Card with network address 123139023573.
Which is seemingly fairly clearcut but neither I nor Amazon support have been able to work out why the DHCP lease attempt fails. In addition since the last incident I have turned on firewall logging for the sipsorcery server’s Windows firewall to see if it could shed any further light on it. From looking at it there is a big gap of over 7 hours where there are no messages logged which I would guess means the network subsystem has been shutdown altogether but the rest of the time there are a lot of connections being established to the DNS server and it’s a mystery why the sipsorcery SIP and other traffic could not be sent or received.
As to why I wasn’t around to fix it I was on a 3 day break and more by design than chance happened to be somewhere where there was no electricity grid let alone mobile signal or internet.
I wasn’t expecting an incident in the 3 days I was away as statistically they have been averaging about one a month and it would be unlucky for that one time to conincide with me being away however unfortunately that’s what happened.
As to what’s being done about it the answer is in the previous post about incorporating the Amazon SimpleDB as the storage layer. Without repeating that and earlier posts once that job is done it will be possible to have two redundant sipsorcery servers running so if an operating system incident like this occurs then the other server will still be available. It’s a big job and goes beyond just switching the data access layer software, for example a number of the sipsorcery services, such as the monitoring, need to be aware of the different instances running on each server. I’ve been working on these tasks flat out for over 2 months now and am getting there.
The other question that could be asked is why stick with Amazon’s EC2 if this issue is at the OS layer and Amazon support can’t help identifying it. That is something I have pondered a fair bit as well. The Amazon EC2 instances aren’t that cheap at the end of the day and there are other compute cloud environments out there. However the Amazon EC2 infrastructure is the oldest and therefore most mature of the clouds and also has by far the best strategy with new services being regularly introduced. I also suspect that shifting to another cloud could just as easily involve introducing the same sort of operational issue and given the amount of effort I have already put into working with the Amazon infrastructure it’s definitely a case of “better the devil you know”.
Finally this does really highlight how vulnerable the sipsorcery service is due to having only one developer/administrator. This particular issue is solved by a reboot of the server. It’s not as simple as giving someone a username and password so they can remotely access and reboot the server. Anyone with that access can potentially gain access to all the sipsorcery user information so it needs to be a suitably trusted person. Ideally what I’m hoping for is a C# developer with an interest in SIP/VoIP to come along and once a level of trust has been established and they have shown they understand the technology so that they don’t go rebooting everytime someone posts about an ATA issue that person would be given admin rights to the sipsorcery server(s). That being said I’m open to any other suggestions about how the sipsorcery service could be run or administered for the benefit of everyone provided any such suggestion takes into account the need for a high level of trust and security.
- Monitoring, heartbeat etc: The sipsorcery server is externally monitored by a completely separate virtual server running in Dublin, Ireland (it’s an extra job on the blueface.ie monitoring server). I get an SMS and email whenever the server does not respond to 10 consecutive SIP OPTIONS requests that are sent to it every 5 seconds. Most of the time I will then investigate the issue to check if it’s transient, network related or some other anomaly and then if needed reboot the server. Automatic server reboots based on an external condition(s) are a BAD idea. One it means the issue will be left unresolved since it’s easier to just let the reboot handle it and two the server can end up in an endless reboot cycle if an unforseen combination of circumstances occur.
- DHCP: The Amazon EC2 (Elastic Compute Cloud) allocates dynamic IP addresses via DHCP to all virtual hosts. There is no way to circumvent DHCP with EC2. In the past static IP’s were available and it was actually a bit of a headache for sipsorcery to be modified to work behind the Amazon NAT since the SIP protocol is very inept at dealing with it. There is also no way to check network cards, cables etc, at least not by anyone except Amazon’s data centre staff. The server sipsorcery runs on is a virtual instance that shares the underlying physical hardware with other virtual instances on Amazon’s EC2. According to the support ticket I logged with Amazon the physcial hardware has been checked and it is operating correctly. As to why the same DHCP issue keeps cropping up neither they nor I know but my bet would be that it’s software not hardware related.
- 3rd party registrations disabled: A number of people have noted that when the sipsorcery server came back up a number of their 3rd party registrations had been disabled with an error message that the provider host could not be resolved in DNS. This behaviour is by design and is necessary. I still find it amazing what ends up in certain fields for provider information and invalid and non-existent hostnames can result in a lot of unecessary work by the sipsorcery registration agent. In this case the providers disabled had genuine host names but because of the networking issue on the Amazon EC2 instance DNS resolutions appear to have been sporadically failing and providing false results to the sipsorcery registration agent.