November 2009

You are currently browsing the monthly archive for November 2009.

As a consequence of being stuck inside on a miserable Dublin December day and being away from my development machine I ended up spending the last few hours looking over some of the proposals that are around for Peer-to-Peer SIP (P2PSIP). My interest is mainly academic, I don’t have any work planned for sipsorcery in the area, but does partly derive from interest in contemplating how Goozimo will go about it in order to compete with Skype. One thing’s for sure they won’t be wasting too much time with the proposed P2PSIP enhancements that are floating around in the IETF space which are taking a bad situation, the existing bloated SIP standard and making it much, much worse!

What’s my problem with the P2PSIP efforts? It’s building a house on foundations of sand. The efforts are relying on a bloated set of SIP standards coupled with relying on hacks, and even hacks to hacks to overcome shortcomings in the original SIP standard in dealing with NAT.

The core SIP standard document is 269 pages long and deals with six request types: ACK, BYE, CANCEL, INVITE, OPTIONS and REGISTER; there are additional standards to deal with what are considered core SIP functionality REFER (aka transfers) (23 pages), INFO (9 pages), NOTIFY (38 pages) and more. Then there are the enhancement standards to fix the things the original SIP standard stuffed up rport (13 pages) and PRACK (14 pages). And only then do the extensions and options, including P2PSIP, come in. The size and complexity of the core SIP standard and the excess of addon SIP standards translates into big problems for implementors. A classic example is the SIP stack in the most popular VoIP server around, Asterisk, it’s taken a massive monolithic 27,000+ line C file to implement and has had some serious difficulties with even the core features; the best example is the 3+ years it took to write the SIP TCP implementation.

The P2PSIP efforts are taking a bad situation, the existing SIP standards and implementation difficulties, and building on top of it to make things worse. It’s not only server implementations like sipsorcery and Asterisk that are implementing SIP stacks but also the hundreds of SIP phones, ATAs and softphone manufacturers. In order to make SIP features work a majority of implementors must put the effort in to implement and test them. As the effort required continues to snowball two things are likely to happen: alternative standards will be developed (Skype’s proprietary protocol is an example); SIP device manufacturers will decide it’s all too hard and restrict themselves to the core standard and/or cherry pick SIP add-on standards thus creating big interoperability problems.

The other problem is that even with the reams of SIP and associated standard documents the NAT problem for even the simplest call scenario has not been solved, a very good and succinct explanation of the problem can be found here. As a consequence a further set of standards has sprung up to help SIP (or more correctly the media streams being initiated by SIP) to cope with NAT: STUN, TURN, and ICE being the most popular ones. The paradox is that in the most prevalent type of SIP calls on the internet today, the call between an end-user and a SIP Provider, the NAT mechanisms are often completely ignored and instead a SIP Provider will reflect the media stream back to the socket the client sends from, not a particularly secure mechanism but a pretty robust one and certainly a far superior one to those offered by STUN, TURN and ICE. So all these NAT standards really do is add to the implementation effort for SIP device manufacturers and while they help in some cases they don’t definitively solve the problem and therefore end up creating more confusion for poor users trying to ascertain why their calls have one way or no audio. A side effect I have observed of the failure of the NAT coping mechanisms is that public forums dealing with SIP services have people suggesting STUN settings for completely unrelated problems such as things like callerid.

The foundations of sand are constituted firstly by an overbloated set of SIP standards that are difficult and error prone to implement and secondly by a set of standards to deal with NAT that are not required in the majority of cases and fail to work in a large number of the remaining cases.

Those are the foundations the P2PSIP efforts are proposing to build on. P2P networks are difficult to design, the biggest problem is how to bootstrap peers into the network. To overcome that problem most P2P designs are actually hybrid P2P networks that rely on a central server for a number of critical functions. Napster is the original example of the hybrid P2P network and the failure of it. Napster facilitated mp3 file sharing, largely illegal sharing as it turned out, between peers in the network and because it relied on a central server to allow peers to join the network the authorities were able to easily shut it down. The networks that followed on from Napster, the likes of Gnutella operated without a central server. Without a central server the peer-to-peer networks are very difficult to shut down which is why these type of networks are still around today and largely immune to authorities.

The P2PSIP documents propose a new type of hybrid P2P network that relies on a central server for boot strapping. The P2P network in P2PSIP is primarily a storage layer utilising a Distributed Hash Table (DHT) approach. The DHT replaces the function of a SIP Registrar and SIP Proxy in a traditional “client-server” SIP network (the inverted commas are because SIP is not a client-server protocol but user agents assume client or server roles for certain operations) and the DHT is used to store the contact details of peers that have joined the network. In theory it’s not a bad idea, SIP registrations are a big burden on traditional SIP networks and offloading them to a peer-to-peer network would seem to have merit. It comes down to a trade-off between the server load for SIP registrations and the complexity of implementing yet another new SIP standard in hundreds of SIP devices. If the P2PSIP proposals were restricted to SIP softphones which have the advantage of operating on flexible general purpose hardware then it would be a debate with merit but the strength of SIP is its universality and unless a new proposal is practical IP phones, ATAs, mobile device softphones etc. then it should not even be considered. Another point is a DHT even the best way to scale the storage SIP location information? A standard called ENUM already exists that utilises DNS, a proven scalable storage service, for location information. SIP user agents already need more sophisticated than normal DNS stacks in order to process SRV records that yet another supplementary SIP standard A DNS RR for specifying the location of services (DNS SRV) relies on and in this case one that is already well supported by existing implementations.

Going back to the original question pondered about how Goozimo will implement a peer-to-peer SIP mechanism in order to compete with Skype my guess is that they won’t go near the current P2PSIP efforts with a barge pole. Scaling server side is something Google are experts at so they’ll handle the SIP registrations using the existing mechanism. How they will deal with media and NAT is the big question. In fact it’s always the question when it comes to SIP and NAT. The paradox in this case is that the solution won’t be found by only considering SIP and NAT: NATs are already deployed everywhere and have to be considered a set fixture; as discussed above SIP is already complicated enough. Instead the media layer, in SIP’s case this is usually RTP, has to get smarter. The solution is not TURN which involves proxying the media or even a Skype like mechanism that uses a more scalable approach with super nodes doing the proxying. The media streams are only getting larger with video and conferences and it’s not scalable to proxy it and also not desireable as every extra hop the media goes through adds latency and potential degradation. The solution is to introduce a mechanism into the media carrying protocols that makes them able to cope with NAT instead of ignoring it. It will may mean the media protocol has to become aware of the signalling protocol, or at least some services offered by it, something which is undesirable from a design point of view with a clean separation of layers between protocols. However if it’s a means to fix the problem where all previous attempts have failed than violating a design principle is worth it. Apart from the time it has taken to write this paragraph I have not put any thought into what such a solution would even look like, perhaps some kind of broker service offered by the signalling layer where the media protocol could send a single rendezvous packet to the signalling server so that the public media socket is known and then can be used in the call request. Perhaps the signalling and media protocols can be multiplexed over a single socket although that would be a big change and I suspect there would be a portion of NATs that would fail to cope properly with a single private socket mapping to multiple public sockets. Fingers crossed the engineers at Goozimo will come up with not just a solution but a good solution and then use Google’s muscle power to prevail it on the industry and solve the abysmal vision of the SIP designers.

I’ve been able to successfully configure two sipsorcery servers in a redundant configuration using Microsoft’s SQL Azure service as the database. That’s good news as it means in the near future it will be possible to switch the main sipsorcery.com service over and remove the exposure of the service failing if a single Amazon EC2 instance fails.

By far the biggest challenge to making the sipsorcery (and most other internet based services) reliable and scalable comes back to the database. It costs a lot of money and takes a lot of expertise to run ANY of RDBMS in a reliable scalable manner. It’s easy to get a single database instance up and running but once you need to start replicating, clustering and load balancing the headaches start.

The mysipswitch service used the Blue Face Postgresql database system. That satisfied the above concerns because Blue Face invested in the necessary hardware and employed an engineer to look after it. The sipsorcery service, which commenced in July 2009, deliberately separated itself from Blue Face’s infrastructure due to business, legal and other non-technical reasons and instead moved to Amazon’s EC2 cloud computing infrastructure. The sipsorcery service currently uses a single server instance which hold all the SIP application servers AND a MySQL database. That means the database is not redundant and there was a painful incident at the start of sipsorcery’s existence where a misconfiguration (by me) resulted in all the MySQL data being lost.

MySQL was used for sipsocery instead of Postgresql because it has better support for replication, clustering and load balancing; none of which Postgresql really supports out of the box or without jumping through a lot of hoops. As it turns out there are quite a few hoops in the MySQL case as well. The sipsorcery service requires a master-master replication strategy and so that two server instances can operate independently of each other but still share data. The MySQL recommended option in that case is MySQL cluster which needs a minimum of 6 servers! Using 6 servers for the sipsorcery database is prohibitive from a cost and admin point of view.

The next idea was to use Amazon’s SimpleDB. It’s not a relational database and instead more like a big bucket that applications can drop small bits of data into and then request them back at a later stage. It does have some rudimentary querying capability but there are big differences between it and a relational database. Since the sipsorcery database requirements are very rudimentary the Amazon SimpleDB service was largely able to satisfy them and I was able to get to a stage where a developement sipsorcery server was able to successfully operate using the SimpleDB as the data store. There were still a few concerns one being how well it would operate under load given that all communications with the SimpleDB must be HTTP over SSL which is significantly slower than the normal TCP connections used with a relational database. Another I was starting to rely more heavily on the querying capability of the MySQL database to shut down abusive sipsorcery accounts and switching to SimepleDB meant I would have had to divert a lot of effort to constructing equivalent detection tools.

Right at that time Amazon released their Relational Database Service product and when I got the introductory email I thought it was going to be the perfect solution for sipsorcery. However once I dug into the specifics it turns out the RDS service is no more than a MySQL server running on a single EC2 instance and the replication, clustering and load balancing are supposedly coming in the future.

Through looking at some other cloud solutions as a consequence of a 2 day outage of the sipsorcery service I re-visited the Microsoft Azure services and looking more closely at the SQL Azure service I realised it was claiming to be everything the Amazon RDS service should be. The SQL Azure service is still in a testing phase but is open to developers so I signed up for a test account and once that was enabled I was able to get a development sipsorcery server up and running with it in no time. At this point I have my fingers crossed that the SQL Azure service will work out and be as reliable and scalable as Microsoft hope because it really does solve a lot of problems for the sipsorcery service.

Once the data storage needs had been satisfied there was still some development work to make the sipsorcery service work properly when deployed over multiple servers. The original mysipswitch service was actually a single process. At the end of 2007 the memory leaks in the Ruby dialplan processing quickly forced the separation of the different servers into their own processes. Now in 2009 the unreliability of the Amazon EC2 instances has forced the further separation into multiple server agents on different machines. For a few of the services it’s not an issue, for example the SIP Registrar simply processes any REGISTER requests it receives and updates the database it does not need to know or care if there are other SIP Registrars operating in parallel. The SIP Registration Agent on the other hand needs a mechanism to ensure that if there are multiple agents operating they aren’t both registering the same SIP Provider accounts. The most difficult aspect is calls, specifically is a SIP account registers through the SIP Proxy on Server1 and then a call on Server2 needs to be forwarded to that account it must go through Server1’s Proxy and not Server2’s. Thaat’s because the end-user SIP account almost always has a NAT in front of it and it will drop any packets from a server it hasn’t already had a transmission with. The requirement then is to make Server1 and Server2 aware of each other and configure them to treat calls from each other’s Application Servers appropriately. That’s the chunk of work that I have recently completed and that is now working.

I plan on doing a bit more testing as well as watching how the SQL Azure database service performs over a longer period. At the moment if anyone is interested, for whatever obscure reason, in having a play with the test servers they are running under the sipwizard.net domain and the two servers are on 174.129.234.254 and 174.129.3.26. I have configured SVR records for sipwizard.net so if your SIP device supports them you can try things like blocking one or other of the IP addresses on your firewall and making sure calls still get through or your device can still register. Note that the sipwizard.net service is completely separate from sipsorcery.com and you’ll need to set up a new test account at www.sipwizard.net. Also the servers are only for testing and WILL be taken down at some stage in the next week as well as subject to my own testing so don’t rely on them being around for long.

I’ve just updated the IronRuby and DLR (Microsoft Dynamic Language Runtime) on the sipsorcery.com Application Servers. I’m interested to see if there is any change in the memory leak behaviour. As far as other changes go I hope there aren’t any changes that break anybody’s dialplans and I’m not expecting there to be. One benefit I noticed was that the conversion between .Net and Ruby types is now seamless so no more need for to_s and to_i when moving between them.

As an example the snippet that didn’t previously work now does.

if req.URI.User =~ /300/
  sys.Log("The monkeys are on the way.")
end

I was hunting around on the Amazon EC2 forums regarding an issue I’m having bundling a new AMI and decided I’d do aquick search to see if anyone else was having issues with unresponsive instances caused by DHCP leases. Lo and behold there are quite a few and they seem to be growing. This thread is a fairly typical example Instance not responding. I didn’t think to search the forums previously, which in hindsight was pretty silly, as I was logging the issue with Amazon premium support and assumed they would be a better source of information than the public forum.

It’s been a tough battle to get the sipsorcery server stable with the memory leak issues in the DLR or IronRuby library, it’s not clear which depsite some extensive investigations, so to finally solve that only to then have the underlying server start causing issues is annoying to say the least.

Despite the EC2 issues it’s definitely a worthwhile goal to have two redundant sipsorcery servers so I’ll keep owkring towards that but if the instability on the EC2 continues then I may have no choice but to migrate as soon as something better becomes apparent.

Update 16 Dec 2009: Had yet another re-occurrence of the sipsorcery.com EC2 instance losing its DHCP lease and becoming inaccessible. I’m going to add every incident to this thread on the EC2 forums Instance not responding.

Now that Google have bought Gizmo it’s my guess that there will be a whole lot of work commencing at Goozimo (Google+Gizmo) around peer-to-peer SIP. The reason is that Goozimo will want to compete head on with Skype to become the communications solution of the masses and the only way to effectively compete is to switch from a centralised SIP model to a hybrid peer-to-peer SIP model.

Why? A centralised SIP model starts to get very expensive and very cumbersome when you have to proxy media. VoIP providers can just get away with it now when voice is the media they carry but what happens once that shifts to video and then high definition video and then multi-party, high definition video and … well you get the idea, the media payload sizes are just going to keep on growing. Even a company the size of Google that probably has per Mbps bandwidth costs in the sub USD1 range will struggle to absorb that level of traffic. And apart from the cost it’s just not a good architectural solution to take something which is inherently peer-to-peer, two end users talking to each other, and turn it into a peer-to-server-to-peer. VoIP providers at the moment don’t worry too much about it because unlike Skype the majority of their traffic is from end users to their PSTN gateways and the percentage of end user to end user calls is negligible. Like the media payload sizes that to will change in the coming years.

The problem for Goozimo is that SIP is poorly designed for peer-to-peer communications. I’ve harped on about this before but it’s always worth reiterating that the SIP designers must have been out on the booze the night before they were due to draw up the methods for dealing with NAT because they just left it out completely, a cardinal sin for any internet protocol. The only reason SIP has prospered with such a massive deficiency is that the only real competing protocol H.323 was designed by PSTN engineers and is even worse.

What can Goozimo do about it? That’s the question. Ideally they’d like to throw SIP away and do things properly and use a proper internet protocol such as XMPP. However that’s not really an option because of the size of the deployed SIP user base coupled with the manufacturing momentum behind it. It’s the same reason Google needed Gizmo rather than just expanding their GTalk service, the Google engineers know XMPP not SIP. My bet is that Goozimo will add to the plethora of SIP “enhancement” standards which must already be nearing triple figures and add a set of features that will make it easier to deal with NAT.

Will peer-to-peer SIP work? Yes, it has to. It’s either that or start from scratch with an alternative protocol and we’re already too far gone for that to happen: IPv4 and IPv6 being a case in point.

How will it work? The saving grace in the whole mess is that the solution probably isn’t that difficult. SIP developers have now had a lot of experience as to how to deal with mangling private IP addresses in the SDP payloads and can let each end of the SIP call know the socket the media should be sent to. The missing piece is the NAT in front of each SIP user agent allowing the media through. Most NATs will only permit an incoming packet through if there has already been an outgoing packet sent to the originating socket. That can be a problem when both ends of a SIP call are behind NAT and the port on one or both of the user agents has been re-mapped by NAT. Essentially that’s the problem that needs to be solved for peer-to-peer SIP to start working. There are different ways it could be done but it’s not so much the technical solution used as to a company the size of Google getting behind it and encouraging manufacturers to roll it out.

Interestingly a lot of NATs do not re-map ports by default, at least not until there is a conflict and they are forced to, and therefore SIP P2P calls are already quite feasible. The reason they are not more widespread comes back to the motivation of the commercial SIP Providers which is to get billable calls into their PSTN gateways. Supporting P2P SIP calls is not going to generate any revenue for them.

The sipsorcery service on the other hand is very interested in P2P SIP calls since being a free service it cannot afford the extra cost of proxying media. In fact every single calls that has ever been placed through the mysipswitch/sipsorcery services has been a P2P one. Generally the calls are between an end user and a SIP Provider but as far as the sipsorcery server is concerned the SIP Provider is just another end point and it treats the call exactly the same as if it was between two end users. A sipsorcery call between two end users will still have an issue if the SDP ports have been re-mapped by the NAT at either end but in practice that seems to be a small percentage of calls.

What does a P2P SIP call mean for an end user? The answer is not much for voice, it’s still better to stick those through a 3rd party provider to take advantage of the NAT handling in their gateway, but for video it’s a different story. Most people I know that use video calls use Skype and all report the video often drops or is choppy. The reason is that Skype’s P2P overlay network still relies on super nodes, which are just other Skype users with good bandwidth in close network vicinity, to proxy the media. When the Skype network gets busy there will be increased contention on the overlay network and the media will suffer. The ideal situation is for the media to travel directly between the two end user agents and not to be proxied by anyone. In the SIP network that provides the added advantage that the end user agents can find the best matching media capability between them rather than is currently the case where it’s the best matching capability between the two end user agents and the SIP Providers server.

As a practical example I have tested video calls with Counterpath’s Bria softphone through the sipsorcery.com service and it works very well. The video capability in the Bria’s is better than Skype and while there can still be chop and break on the video at least now it’s down only to the internet connections at either end of the call rather than also the ones of the Skype supernodes.

There is one trick to getting the Bria’s to work with sipsorcery and that is to ensure the call is made as a video call initially and not a voice call followed by an attempt to start a video one. In the latter case the re-INVITEs can end up with the wrong IP addresses as the sipsorcery server does not mangle in dialogue requests. If the call is placed as a video one straight away the sipsorcery server will mangle the initial SDP and the RTP carrying the video has the best chance of getting established. The diagram below shows the “Video Call” button that appears when the Bria is switched to video mode and it is the one that should be used to place calls through sipsorcery.

Because of the sipsorcery.com outage last week as a result of the Amazon EC2 instance losing it’s network connection I got motivated to look around at the other cloud computing environments and while I didn’t find anything that matches Amazon’s flexibility one thing I did look into i depth was Microsoft’s SQL Azure offering.

SQL Azure is a hosted instance of Microsoft’s SQL 2008 database with all the really difficult issues like clustering, replication, load balancing supposedly taken care of. That means from an application point of view the database can be treated as a nice big bucket to get things to and from. Compared to Amazon’s Relational Database Service SQL Azure is far superior because the Amzon RDS is simply a single instance of a relational database with none of the afore mentioned clustering, replication and load balancing. The Amazon RDS would not offer a lot above the sipsorcery’s current database approach where a MySQL instance is running on the sipsorcery.com server.

I have discussed this problem in the past and in fact have spent quite a lot of time over the last 3 months working on a way to allow Amazon’s SimpleDB to be used as sipsorcery’s data store. While I did acheive that goal there are just enough shortcomings of the SimpleDB offering to keep me a bit hesitant about switching over to it. Its querying capability is very limited compared to a relational database but the biggest worry is that all communications with it are HTTPS which introduce quite a bit of latency. Now it’s probably not enough to be a showstopper, in my initial testing I observed a delay of between 500 and 800ms when making a call above the current situation, but enough to be slightly concerned about.

Out of interest I decided to try and use SQL Azure as the data store for a sipsorcery instance and see what additional latency that would introduce given that the database communications would now need to travel between Amazon’s and Microsoft’s data centres. The results were very heartening and were in the order of 100 to 200ms for call set up which is only just percetible for a user making the call. That coupled with the fact that SQL Azure would still provide the powerful querying capability of a relational database has placed it firmly in place as the favoured sipsorcery data store option. The development work to use SQL Azure wass less than a day which was because all data access in the sipsorcery servers now uses LINQ and Microsoft include Linq-to-SQL as a core library in the .Net framework, another big advantage.

Now that the SSH console is largely complete the next job I am going to tackle is to try and get two sipsorcery servers running side by side in a redundant mode so that if an incident like last week’s does occur and an instance becomes inaccessible the other one will still be available. Up until last week I was going to do that by using Amazon’s SimpleDB but if things continue to go smoothly with SQL Azure it will eventually become the new sipsorcery data store.

Access to sipsorcery.com log messages is now available using SSH. To access simply ssh to sipsorcery.com and login using the same username and password you login to the Silverlight GUI with.

The reason it’s taken a while to get SSH integrated is that yet another bleeding edge open source project has been used, in this case NSsh (many thanks to Luke Quinane the project founder), and there have been a few teething issues to overcome. Specifically the NSsh server needed to have a few extra access control mechanisms added to it in order to be able to survive on the internet. SSH being a well known service attracts a lot of attention from script kiddies trying all sorts of exploits such as buffer overflows, malformed packets and denial of service.

One consequence of the SSH server being so new is that I have limited the number of simultaneous clients it will acept to 20 and no more than 2 from any one IP address. So while it’s now open for connections to anyone that wants to monitor their sipsorcery messages if you get an immediate disconnect when you attempt to reach it that will most likely be becuase it’s busy. Once it’s proven itself and I have a better idea of the load it generates I’ll hopefully be able to lift the limits.

Update: I neglected to mention that I have only tested the server with Cygwin (openssh) and Putty clients and public key authentication is not supported.

An outage of the sipsorcery service occurred for almost exactly 48 hours between the 11th and 13th of November. The cause of the outage is not exactly known but it is the same as the previous 5 outages of which the most recent was on the 22nd of October. I’m pretty sure the issue is at the operating system level and possibly something to do with the Windows virtualisation configuration being used by Amazon’s EC2 cloud. I’ve had a ticket open on the issue with Amazon since the first instance but they have not been able to identify anything wrong and apparently the issue isn’t occurring for anyone else. The last message in the Windows event log prior to this and the other outages is along the lines of:

11/12/2009 11:27:18 AM: EventLogEntry: Error 11/12/2009 11:27:13 AM Dhcp Your computer has lost the lease to its IP address 10.248.58.129 on the
Network Card with network address 123139023573.

Which is seemingly fairly clearcut but neither I nor Amazon support have been able to work out why the DHCP lease attempt fails. In addition since the last incident I have turned on firewall logging for the sipsorcery server’s Windows firewall to see if it could shed any further light on it. From looking at it there is a big gap of over 7 hours where there are no messages logged which I would guess means the network subsystem has been shutdown altogether but the rest of the time there are a lot of connections being established to the DNS server and it’s a mystery why the sipsorcery SIP and other traffic could not be sent or received.

As to why I wasn’t around to fix it I was on a 3 day break and more by design than chance happened to be somewhere where there was no electricity grid let alone mobile signal or internet.

[googlemaps http://maps.google.com/maps?f=q&source=s_q&hl=en&geocode=&q=bruny+island,+tasmania&sll=-43.493904,147.141352&sspn=0.27148,0.545197&ie=UTF8&rq=1&ev=zi&radius=13.67&hq=bruny+island,+tasmania&hnear=&ll=-43.493904,147.141352&spn=0.27148,0.545197&output=embed&w=425&h=350]

I wasn’t expecting an incident in the 3 days I was away as statistically they have been averaging about one a month and it would be unlucky for that one time to conincide with me being away however unfortunately that’s what happened.

As to what’s being done about it the answer is in the previous post about incorporating the Amazon SimpleDB as the storage layer. Without repeating that and earlier posts once that job is done it will be possible to have two redundant sipsorcery servers running so if an operating system incident like this occurs then the other server will still be available. It’s a big job and goes beyond just switching the data access layer software, for example a number of the sipsorcery services, such as the monitoring, need to be aware of the different instances running on each server. I’ve been working on these tasks flat out for over 2 months now and am getting there.

The other question that could be asked is why stick with Amazon’s EC2 if this issue is at the OS layer and Amazon support can’t help identifying it. That is something I have pondered a fair bit as well. The Amazon EC2 instances aren’t that cheap at the end of the day and there are other compute cloud environments out there. However the Amazon EC2 infrastructure is the oldest and therefore most mature of the clouds and also has by far the best strategy with new services being regularly introduced. I also suspect that shifting to another cloud could just as easily involve introducing the same sort of operational issue and given the amount of effort I have already put into working with the Amazon infrastructure it’s definitely a case of “better the devil you know”.

Finally this does really highlight how vulnerable the sipsorcery service is due to having only one developer/administrator. This particular issue is solved by a reboot of the server. It’s not as simple as giving someone a username and password so they can remotely access and reboot the server. Anyone with that access can potentially gain access to all the sipsorcery user information so it needs to be a suitably trusted person. Ideally what I’m hoping for is a C# developer with an interest in SIP/VoIP to come along and once a level of trust has been established and they have shown they understand the technology so that they don’t go rebooting everytime someone posts about an ATA issue that person would be given admin rights to the sipsorcery server(s). That being said I’m open to any other suggestions about how the sipsorcery service could be run or administered for the benefit of everyone provided any such suggestion takes into account the need for a high level of trust and security.

Addendums

    Monitoring, heartbeat etc: The sipsorcery server is externally monitored by a completely separate virtual server running in Dublin, Ireland (it’s an extra job on the blueface.ie monitoring server). I get an SMS and email whenever the server does not respond to 10 consecutive SIP OPTIONS requests that are sent to it every 5 seconds. Most of the time I will then investigate the issue to check if it’s transient, network related or some other anomaly and then if needed reboot the server. Automatic server reboots based on an external condition(s) are a BAD idea. One it means the issue will be left unresolved since it’s easier to just let the reboot handle it and two the server can end up in an endless reboot cycle if an unforseen combination of circumstances occur.
    DHCP: The Amazon EC2 (Elastic Compute Cloud) allocates dynamic IP addresses via DHCP to all virtual hosts. There is no way to circumvent DHCP with EC2. In the past static IP’s were available and it was actually a bit of a headache for sipsorcery to be modified to work behind the Amazon NAT since the SIP protocol is very inept at dealing with it. There is also no way to check network cards, cables etc, at least not by anyone except Amazon’s data centre staff. The server sipsorcery runs on is a virtual instance that shares the underlying physical hardware with other virtual instances on Amazon’s EC2. According to the support ticket I logged with Amazon the physcial hardware has been checked and it is operating correctly. As to why the same DHCP issue keeps cropping up neither they nor I know but my bet would be that it’s software not hardware related.
    3rd party registrations disabled: A number of people have noted that when the sipsorcery server came back up a number of their 3rd party registrations had been disabled with an error message that the provider host could not be resolved in DNS. This behaviour is by design and is necessary. I still find it amazing what ends up in certain fields for provider information and invalid and non-existent hostnames can result in a lot of unecessary work by the sipsorcery registration agent. In this case the providers disabled had genuine host names but because of the networking issue on the Amazon EC2 instance DNS resolutions appear to have been sporadically failing and providing false results to the sipsorcery registration agent.

On Saturday (two days ago) I decided to upgrade the sipsorcery service to the latest version of my code. Prior to that apart from a few minor updates around dialplan processing the sipsorcery software hadn’t been updated for nearly two months. The impetus for Saturday’s update was to take another step closer to getting the service ready to migrate from MySQL to Amazon’s SimpleDB.

There were two major parts to the upgrade:

    1. The data access layer code has been rewritten to add SimpleDB as a persistence option in addition to an SQL RDBMS and XML. As part of that rewrite the DbLinq library was removed and a custom very simplified Linq-to-SQL library was written. The main reason to replace DbLinq was performance. I’d already had to revert to a mechanism of using raw SQL for certain high volume queries in the live sipsorcery site and given that the sipsorcery service only requires extermely simple SQL queries to be used I decided to see if I could come up with a smaller, simpler and hopefully faster Linq-to-SQL implementation.
    2. The timestamp fields in the MySQL database needed to be converted to varchar so that when the time comes the data can be migrated to SimpleDB. With SimpleDB select queries are all string based and as sipsorcery relies heavily on timestamp fields for a lot of its operations the format they are stored in had to be changed.

I was a little bit worriead about the update as anytime database schemas need to change on a running system it can be a bit hairy and replacing the data access layer software in-situ is also not for the faint hearted. However everything went suprising well. There were no complications and the only noitceable effect on the system was that calls were not processed for about a minute.

Subsequent to the upgrade a few minor bugs cropped up. One that appeared to cause a few ATAs issues was that the datetime format in the SIP header fields was accidentally changed to a round trip format, which is what the database now uses and looks like 2009-06-15T13:45:30.0900000, instead of the format mandated by the SIP standard which looks like Sun, 8 Nov 2009 12:12:21 GMT. Apparently that can cause some ATAs to reject responses and have other weird and inexplicable consequences.

A good side effect to come from the upgrade was that the new simplified Linq-to-SQL implementation has reduced CPU consumption by nearly half over what the system was getting using DbLinq. That’s good news because the sipsorcery server was starting to creep back up over 80% average utilisation as more people have started using it. Now it’s back down to an average utilisation in the low 40’s.

sipsorcery-cpu-8nov09

SIP Sorcery CPU utilisation with new Linq-to-SQL implementation

So a big step has been taken towards using SimpleDB as the data store for sipsorcery which will result in the ability to run two redundant sipsorcery servers sharing the same data. The next step is to get the SSH and HTTP duplex monitoring sorted out, add in encryption for sensitive databse fields and then do a bit more testing to make sure the extra latency with SimpleDB requests is being handled appropriately.

When mysipswitch was first conceived it was one of those things that could easily have ended up lasting a few months as such the software was written very much as a prototype and wherever a shortcut could be taken it was. One of those shortcuts was to allow access to the mysipswitch real-time log messages via telnet. The problem with telnet is that it transmits data in cleartext meaning anyone able to capture the authentication packets can obtain usernames and passwords. The choice for mysipswitch at the time was to provide the log messages over telnet or not provide them at all since incorporating a more secure mechanism would have taken more time and effort than was available.

In practice it’s actually quite difficult for an arbitrary attacker to capture packets on the internet. Sure it’s easy on a compromised PC, on a LAN or for a network admin at an ISP but apart from that it’s a lot of effort. An attacker who has gone to that level of effort and has compromised a core router or similar is likely to be looking for a return on investment and is going to be after credit card details, online banking passwords etc. and very unlikely to be interested in usernames and passwords for a esoteric SIP aggregator service.

Now that I have a little bit more time on my hands over the last two or so months I have been incorporating an SSH daemon into sipsorcery to replace the telnet one. In addition the console monitoring available in the Silverlight client has been modified to allow it to work over a secure channel. Previously it used a plain text TCP socket connection and while no usernames and passwords were transmitted over it a time limited authorisration token was and the log messages from the sipsorcery server were sent on it. To fix that a new HTTP duplex mechanism has been added so that the login and log messages to and from the silverlight client will now be transmitted over SSL. It’s proven somewhat tricky to implement the HTTP duplex mechanism as HTTP is a request/response protocol and doesn’t lend itself to pushing information to the client. At the moment it works in spurts but then IIS will give up the ghost and decide it can’t talk to the sipsorcery monitoring service and refuse to process any requests. I’ve disabled the server side of the logging while I continue to work on it but am hopeful it will be sorted out soon.

For the SSH monitoring that is working but it to has been restricted as there is an issue with it that when probed by the script kiddies doing their never ending trawls across the internet on port 22 it freezes up the sipsorcery SSH daemon. It’s new software so that’s not entirely suprising but to avoid the whole sipsorcery service being affected I’ve limited access to the SSH port. Again I’m hopeful to have the problem sorted out in the near future.

While this work is ongoing the telnet monitoring option has been turned off and barring any unforseen complications will be gone for good.