An outage of the sipsorcery service occurred for almost exactly 48 hours between the 11th and 13th of November. The cause of the outage is not exactly known but it is the same as the previous 5 outages of which the most recent was on the 22nd of October. I’m pretty sure the issue is at the operating system level and possibly something to do with the Windows virtualisation configuration being used by Amazon’s EC2 cloud. I’ve had a ticket open on the issue with Amazon since the first instance but they have not been able to identify anything wrong and apparently the issue isn’t occurring for anyone else. The last message in the Windows event log prior to this and the other outages is along the lines of:
11/12/2009 11:27:18 AM: EventLogEntry: Error 11/12/2009 11:27:13 AM Dhcp Your computer has lost the lease to its IP address 10.248.58.129 on the
Network Card with network address 123139023573.
Which is seemingly fairly clearcut but neither I nor Amazon support have been able to work out why the DHCP lease attempt fails. In addition since the last incident I have turned on firewall logging for the sipsorcery server’s Windows firewall to see if it could shed any further light on it. From looking at it there is a big gap of over 7 hours where there are no messages logged which I would guess means the network subsystem has been shutdown altogether but the rest of the time there are a lot of connections being established to the DNS server and it’s a mystery why the sipsorcery SIP and other traffic could not be sent or received.
As to why I wasn’t around to fix it I was on a 3 day break and more by design than chance happened to be somewhere where there was no electricity grid let alone mobile signal or internet.
I wasn’t expecting an incident in the 3 days I was away as statistically they have been averaging about one a month and it would be unlucky for that one time to conincide with me being away however unfortunately that’s what happened.
As to what’s being done about it the answer is in the previous post about incorporating the Amazon SimpleDB as the storage layer. Without repeating that and earlier posts once that job is done it will be possible to have two redundant sipsorcery servers running so if an operating system incident like this occurs then the other server will still be available. It’s a big job and goes beyond just switching the data access layer software, for example a number of the sipsorcery services, such as the monitoring, need to be aware of the different instances running on each server. I’ve been working on these tasks flat out for over 2 months now and am getting there.
The other question that could be asked is why stick with Amazon’s EC2 if this issue is at the OS layer and Amazon support can’t help identifying it. That is something I have pondered a fair bit as well. The Amazon EC2 instances aren’t that cheap at the end of the day and there are other compute cloud environments out there. However the Amazon EC2 infrastructure is the oldest and therefore most mature of the clouds and also has by far the best strategy with new services being regularly introduced. I also suspect that shifting to another cloud could just as easily involve introducing the same sort of operational issue and given the amount of effort I have already put into working with the Amazon infrastructure it’s definitely a case of “better the devil you know”.
Finally this does really highlight how vulnerable the sipsorcery service is due to having only one developer/administrator. This particular issue is solved by a reboot of the server. It’s not as simple as giving someone a username and password so they can remotely access and reboot the server. Anyone with that access can potentially gain access to all the sipsorcery user information so it needs to be a suitably trusted person. Ideally what I’m hoping for is a C# developer with an interest in SIP/VoIP to come along and once a level of trust has been established and they have shown they understand the technology so that they don’t go rebooting everytime someone posts about an ATA issue that person would be given admin rights to the sipsorcery server(s). That being said I’m open to any other suggestions about how the sipsorcery service could be run or administered for the benefit of everyone provided any such suggestion takes into account the need for a high level of trust and security.