Occassional posts about VoIP, SIP, WebRTC and Bitcoin.
For over a year now the mysipswitch service has suffered from instability due to an issue isolated to the processing of Ruby dialplan scripts.
At the time the problem first came to my attention it took the best part of a month with lots of memory profiling to nail the issue down and then put in improvements to contain it. The symptoms of the problem were large memory consumption followed ultimately by the script processing failing to initialise and therefore going nowhere before being cancelled by the monitoring thread with the dreaded “Long running dialplan script was terminated” message.
The issue was mitigated against by separating out the mysipswitch functions into different processes and scheduling automatic restarts of the Application Server process that was responsible for the dialplans. This bought another 6 or so months of reliable use of mysipswitch but eventually as usage increased a daily auto-restart was not enough and more frequent restarts were required. Unfortunately that did not work well either as sometimes the Windows Service would fail to start properly and the restarts had to be left at every 24 hours.
The hope was that the problem in the IronRuby library would be fixed in the interim and would solve the issue on mysipswitch when it evolved to sipsorcery.
When the sipsorcery service went live the latest IronRuby source was downloaded from svn, built and deployed. It was a sinking feeling when the same memory leak was observed on the sipsorcery Application Server Process.
At that point I decided I’d have to bite the bullet and dive into the IronRuby code and see if I could find the bug responsbile for the leak. I’d avoided this in the past as I prefer to focus on adding new features or sorting out bugs in the sipsorcery code base rather than getting distracted on length forays into the 3rd party libraries.
The dive into IronRuby was launched and after 3 days of testing and scouring the code it ended up looking like the problem was not a memory leak being caused by a bug but instead was being caused by the use of Thread.Abort to cancel the dialplan scripts once a call was answered. This was a good discovery to make and at the time I thought the solution must be just around the corner, I was wrong.
Since Thread.Abort was causing the problem I had to look for an alternative way to ending the dialplan scripts. IronRuby runs on top of the Microsoft Dynamic Language Runtime so I started looking over the documentation for that and found a reference to “Interrupt Execution” which is exactly what was needed to halt the answered or long running scripts. This was another false dawn as that feature is not yet implemented in the DLR and is not on the current roadmap.
Being optimistic I thought I’d now delve into the DLR code and see if I could implement Interrupt Execution myself. Unsuprisingly it ended up being a lot more difficult than I thought but eventually I came up with a very crude mechanism where I would keep track of which script interpreters belonged to which dialplan and would then halt their execution once a call was answered. Initially that seemed to work well but then I noticed the interpreters seemed to be getting shared between multiple dialplans and halting the execution of one dialplan could end up halting another one that was running at the same time. This would also explain some behaviour that people were posting on the mysipswitch forums.
So that approach had to go and I have removed the code that was shutting down the interpreters. And…there is no And… that’s where things are right now. Currently there is no nice tidy mechanism to halt a dialplan after it has been answered. I have put a check into the sys.Dial command so that it will not run after the call has been answered but subsequent script commands will continue to be run until the end of the script.
That’s not the end of the World but it’s not ideal and also doesn’t cover the case where a user accidentally puts their script into an infinite loop. So it’s back to the drawing board and in the meantime you should check that your dialplan is ok to execute the commands after a call gets answered, most should be.