July 2009

You are currently browsing the monthly archive for July 2009.

When I placed my sipsorcery test call this morning I got the fast busy tone. It wasn’t a shock given the issues that still exist with the dialplan processing and I assumed it must have been a memory or thread leak that got out of hand overnight.

When I logged into the sipsorcery server and checked the App Server process it’s resource utilisation was low which was a suprise. Checking the logs revealed an error message connecting to the MySQL database stating that the database already had too many connections.

That’s a new one. The issue was fixed by restarting the App Server so it was definitely that process that had a large number of connections open. The question is what opened them all? There is a new dialplan application that allows database queries but it should close each connection after use. This is another issue that will require some more investigation.

Update: Re-occurrence of this issue at around 25 Jul 2009 2200 UTC. I had left the MySQL admin console open so I could check which connections were open however it seems to have had its connection knocked off in the meantime so still none the wiser about the cause. Will now try restricting the number of connections on each of the different connection strings.

To date every attempt to solve or identify where the memory leak is coming from when processing Ruby dialplans has met with failure. That’s despite a concerted effort analysing and tracing the dialplan processing including the purchase of a relatively expensive memory profiling tool. The latest attempts seemed to be going somewhere initially by removing the Thread.Aborts – which seemed to be leaving something in the scripting engine in a bad state leading to the leak – and instead interrupting the execution of the engine by setting a variable.

The interrupt execution approach seemed to work fine both in testing and in the first few days on sipsorcery.com but then the leak re-appeared and even worse some weird things started happening where  log messages from one dialplan execution appeared in another dialplan’s execution. That’s bad and my guess is that within the internals of the DLR (Dynamic Language Runtime) the script interpreters are being cached and interrupting the execution of the interpreters left some internal states awry.

It’s somewhat difficult to troubleshoot and test the dialplan execution because once the script starts getting interpreted then there is no easy way to debug through it. The lines of script get converted into lambda methods and executed as a series of instructions on the interpreter. It’s a bit like debugging a C application in assembly mode, i.e. next to impossible for all but the most trivial application.

I’ve put together a timeline of the issue so I can one day look back on it with fond memories when the issue is eventually solved.

  • May/Jun 2008 Memory leak first appeared on mysipswitch.com, mysipswitch forums: Instability Jun 2008
  • 6 Jul 2008 Isolated leak to the dialplan processing and IronRuby engine, IronRuby forums: Memory Leak with Certain Script Exceptions
  • 31 Jul 2008 Stage 1 of software upgrade to separate mysipswitch server agents mysipswitch forums: Software Update 31 Jul 2008
  • 6 Sep 2008 Stage 2 of software upgrade to separate mysipswitch server agents mysipswitch forums: SIPSwitch Upgrade – 6 Sep 2008 and mysipswitch blog: Pear Skidding
  • Sep 2008 to Jun 2009 sipsorcery.com upgrade from mysipswith.com under heavy development. No further investigation into the dialplan processing leak undertaken with the hope being that later versions of IronRuby, which is also under heavy development, would not have the same behaviour when interacting with sipsorcery,
  • 24 Jun 2009 sipsorcery.com went live,
  • 7 Jul 2009 call volume increases on sipsorcery.com and memory leak behaviour in dialplan processing observed. Undertook investigation of the leak this time debugging into the IronRuby and DLR libraries,
  • 9 Jul 2009 Hypothesis that Thread.Abort call to halt completed dialplans was causing the leak. Delved into DLR design and discovered “interrupt execution” which is the theoretical design solution to the problem IronRuby forums: Interrupt Execution and DLR forums: Interrupt Execution,
  • 11 Jul 2009 Upgraded sipsorcery.com to use an interrupt execution apporach in the DLR library and removed the Thread.Abort calls,
  • 15 Jul 2009 Memory leak behaviour on sipsorcery.com coupled with crossover of dialplan log messages,
  • 16 Jul 2009 Removed the interrupt execution changes made to the Microsoft.Scripting (DLR) assembly which means dialplans will not be terminated and will be left to run to completion,
  • 17 Jul 2009 Memory leak behaviour caused app server to become unresponsive and not process calls,
  • 18 Jul 2009 Initial implementation work for call-dispatcher/isolated-process approach for dialplan processing.
  • 23 Aug 2009 Dial plan processing failed (“Long Running Dialplan” message).

One of the new features of sipsorcery is the ability to create multiple dialplans. This means it’s now a lot easier to do a bit of testing if you’re struggling with a bit of Ruby script.

As an example say you have an issue with a regular expression match (a topical subject on the forums at the moment) you could create a new dialplan and put some minimal regular expression matching code in it.

sys.Log("webcallback dialplan starting...callback number=#{req.URI.User}")
case req.URI.User
   when /*1234$/ then sys.Log("#{req.URI.User} matched /*1234$/")
   else sys.Log("No match")
end

In this case I have used my webcallback dialplan so that I can do a test by invoking the callback URL
instead of having to call into my dialplan.

To invoke the above dialplan I used the following URL, you will need to put in your own sipsorcery username for it to work for you.

http://www.sipsorcery.com/callmanager.svc/webcallback?user=username&number=*1234

I got the dialplan trace emailed to me and unsuprisingly the log messages were:

DialPlan=> Dialplan trace commenced at 16 Jul 2009 18:36:08:180.
DialPlan=> webcallback dialplan starting…callback number=*1234
DialPlan=> *1234 matched /*1234$/

I just did my customary sipsorcery test call after getting up this morning and it failed, damn. Terminal serviced into the sipsorcery server and the App Server was running with around 530MB of memory (way too much indicating a leak), 55 threads and 22% CPU. The memory and the threads are nothing new and they used to go a lot higher. However the CPU has never flat lined like that before.

I saved the logs and will trawl through them after work. My suspicion is that having removed the mechanisms to interrupt the execution of dialplan scripts to try and get rid of the memory leak and inconsistent dialplan processing a dialplan has been used that was long running and did need to be cancelled. Probably it was someone inadvertently putting their script into an infinite loop. To be honest I expected it to happen but was hoping for it to be later rather than sooner.

It’s back to the drawing board over this weekend to try and come up with a different solution for dial plan execution interrupt.

Sigh, sigh, sigh….

v1.0 release

I’ve just uploaded a v1.0 release of sipsorcery that’s designed for local installs. The release is available at sipsorcery v1.0.

For over a year now the mysipswitch service has suffered from instability due to an issue isolated to the processing of Ruby dialplan scripts.

At the time the problem first came to my attention it took the best part of a month with lots of memory profiling to nail the issue down and then put in improvements to contain it. The symptoms of the problem were large memory consumption followed ultimately by the script processing failing to initialise and therefore going nowhere before being cancelled by the monitoring thread with the dreaded “Long running dialplan script was terminated” message.

The issue was mitigated against by separating out the mysipswitch functions into different processes and scheduling automatic restarts of the Application Server process that was responsible for the dialplans. This bought another 6 or so months of reliable use of mysipswitch but eventually as usage increased a daily auto-restart was not enough and more frequent restarts were required. Unfortunately that did not work well either as sometimes the Windows Service would fail to start properly and the restarts had to be left at every 24 hours.

The hope was that the problem in the IronRuby library would be fixed in the interim and would solve the issue on mysipswitch when it evolved to sipsorcery.

When the sipsorcery service went live the latest IronRuby source was downloaded from svn, built and deployed. It was a sinking feeling when the same memory leak was observed on the sipsorcery Application Server Process.

At that point I decided I’d have to bite the bullet and dive into the IronRuby code and see if I could find the bug responsbile for the leak. I’d avoided this in the past as I prefer to focus on adding new features or sorting out bugs in the sipsorcery code base rather than getting distracted on length forays into the 3rd party libraries.

The dive into IronRuby was launched and after 3 days of testing and scouring the code it ended up looking like the problem was not a memory leak being caused by a bug but instead was being caused by the use of Thread.Abort to cancel the dialplan scripts once a call was answered. This was a good discovery to make and at the time I thought the solution must be just around the corner, I was wrong.

Since Thread.Abort was causing the problem I had to look for an alternative way to ending the dialplan scripts. IronRuby runs on top of the Microsoft Dynamic Language Runtime so I started looking over the documentation for that and found a reference to “Interrupt Execution” which is exactly what was needed to halt the answered or long running scripts. This was another false dawn as that feature is not yet implemented in the DLR and is not on the current roadmap.

Being optimistic I thought I’d now delve into the DLR code and see if I could implement Interrupt Execution myself. Unsuprisingly it ended up being a lot more difficult than I thought but eventually I came up with a very crude mechanism where I would keep track of which script interpreters belonged to which dialplan and would then halt their execution once a call was answered. Initially that seemed to work well but then I noticed the interpreters seemed to be getting shared between multiple dialplans and halting the execution of one dialplan could end up halting another one that was running at the same time. This would also explain some behaviour that people were posting on the mysipswitch forums.

So that approach had to go and I have removed the code that was shutting down the interpreters. And…there is no And… that’s where things are right now. Currently there is no nice tidy mechanism to halt a dialplan after it has been answered. I have put a check into the sys.Dial command so that it will not run after the call has been answered but subsequent script commands will continue to be run until the end of the script.

That’s not the end of the World but it’s not ideal and also doesn’t cover the case where a user accidentally puts their script into an infinite loop. So it’s back to the drawing board and in the meantime you should check that your dialplan is ok to execute the commands after a call gets answered, most should be.

Sigh…

Aaron

One of the biggest enhancements in the sipsorcery service compared to the previous mysipswitch service is the number of different ways of processing incoming calls. The drivers for the enhancements were requests by various mysipswitch users for alternative or more flexible ways to process incoming calls. At this stage I believe pretty much every request in the area has been accommodated but unfortunately a side effect has been a number of sipsorcery users having difficulty getting their calls working as they expect. The intention of this post is to clarify the mechanisms available for incoming calls and how they work.

Before delving into the incoming call processing mechanisms one point that needs to be clarified is the mysipswitch concept of SIP accounts and Extensions. An extension with mysipswitch was a SIP URI (e.g. sip:joe.bloggs@mysipswitch.com) that could be handed out and that was stored in a separate database table to the SIP accounts and that was looked up if no matching SIP account was found. Extensions were very much a bolt on to mysipswitch to cope with the fact that only a single SIP account could be created oer mysipswitch user. With sipsorcery the single SIP account restriction is gone and with it the need to have a separate concept of an extension. Instead there is now the option to create a SIP account that can be specified as incoming only and which will act the same as the mysipwitch Extensions.

Add incoming only SIP account

Add incoming only SIP account

Now that that’s clarified onto the different options for processing incoming calls.

  • Direct to registered bindings,
  • In Dial Plan,
  • Suffix matching.

Direct

The Direct option is the default option and is the simplest. It works by forwarding any incoming calls to a SIP account to any currently registered bindings for that SIP account. To use this option the only thing required is that the In Dial Plan setting on the SIP account is left blank. As an example if I were to create a SIP account called joe.bloggs@sipsorcery.comand set up one of my SIP Providers to forward calls to sip:joe.bloggs@sipsorcery.com then to receive those calls all I need to do is register my SIP device with sipsorcery.com with a username of joe.bloggs. Up to 10 SIP devices can register with the same username and when an incoming call comes in they will all ring with the first one that picks up getting the call.

In Dial Plan

The In Dial plan is as the name suggests and the incoming call will be forwarded to the dialplan specified in the In Dial Plan setting for processing. Within the dialplan the call can be processed in any manner desired. An example of an in dialplan that accomplishes exactly the same as the previous Direct option for incoming call processing is shown below.

sys.Log("dialplan defaultin starting...")
sys.Log("call from #{req.Header.From.FromURI.ToString()} to #{req.URI.User}.")
sys.Dial("myextension@local")

In the above dialplan an incoming call to sip:myextension@sipsorcery.com will be sent to the defaultin dialplan for processing. Once there it will print the two log messages and then the call will be forwarded to all the current bindings registered for the myextension@sipsorcery.com SIP Account, i.e. exactly the same behaviour as the Direct option.

Suffix Matching

Another mechanism that was cobbled on, in the same way extensions were, to mysipswitch was the ability to process calls that had an arbitrary suffix with a prefix matching an existing SIP account username. By popular demand the same behaviour has been added to sipsorcery.

As an example of how it works if I owned a SIP account called myextension@sipsorcery.comthen incoming calls to the following SIP URIs will also be accepted and processed in exactly the same way as myextension:

  • sip:x.myextension@sipsorcery.com
  • sip:1234.myextension@sipsorcery.com
  • sip:1234.5678.abcd.myextension@sipsorcery.com

The crucial factor is all the URIs end with .myextension@sipsorcery.com note the ‘.’ it is the separator between the suffix and the SIP Acount name.

Calls to other sipsorcery accounts from within a dialplan

With sipsorcery if you placed a call to another sipsorcery SIP Account within a dialplan then that call will honour the In Dial Plan setting on the called SIP Account. For example if I use the forward below in my dialplan:

sys.Dial("myextension@local")

If the myextension SIP account has an In Dial Plan set then the call will now be sent to that dialplan for processing rather than directly to myextension’s registered bindings. If the In Dial Plan is not set then the Dial command will result in the call being forwarded directly to myextension’s registered bindings.

This is a powerful and potentially dangerous feature since a chain of calls to local accounts or two accounts dialling each could result in a large number of dialplans being simultaneously processed. It is requested that suitable precautions be taken to avoid that happening for anyone that wished to make use of this feature.

One additional point when calling local accounts within a dialplan is that if a forward is attempted to a SIP account from the same dialplan that is configured on its In Dial Plan setting then the dialplan will not be used, doing so would create an infinite loop, and instead the registered bindings will be used.

And Finally…

it’s not directly related to the incoming call processing mechanisms but if you find that calls are not reliably getting through to your SIP device from the sipsorcery server then more than likely your NAT is timing out the connection. The best way around that is to check your device for a NAT keep-alive option and activate it. NAT keep-alives work by sending a small packet every 15 to 30s from a SIP user agent to the server to keep the connection alive through the router. If there is no option on your user agent then you can select the option on the SIP Account at the sipsorcery end. This does the same thing in reverse. It’s not quite as good as NATs will respect traffic originating from the private side better than traffic originating from the public side but it should still help.

In addition the expiry interval on your user agent can be reduced. The sipsorcery SIP registrar uses a default register expiry of 113s, sometimes reducing that to 60s can improve the durability of the connection between your user agent and the sipsorcery server through your NAT.

And that’s it, please feel free to post comments with questions for any clarifications on the subject of incoming call processing but I would request thatyou not post comments with general support questions.