Occassional posts about VoIP, SIP, WebRTC and Bitcoin.
It’s now been 3 weeks since the Isolated Process dial plan processing mechanism was put in place on the sipsorcery service. The news on it is good and while there were a few tweaks required in the first couple of weeks, which were more down to preventing some users initiating 20+ simultaneous executions of their dialplans, in the last week there have been no software updates or restarts required. During that time the sipsorcery application server, which processes the dial plan executions and has been the trouble spot, operated smoothly with no issues.
As discussed ad-nauseum in the past the root cause of the reliability issue on the services is a memory leak either in the Dynamic Language Runtime (DLR) or in the integration between sipsorcery and the DLR. The solution has been to isolate the processing of the dialplans in separate process and perioidcally recycle those processes.
I now feel pretty comfortable about the reliability of the sipsorcery application server and am reasonably confident that a solution to the instability issue that has plagued mysipswitch and sipsorcery has been found, at least for sipsorcery. As also mentioned previously the mysipswitch service cannot be easily updated anymore since the code has diverged significantly since it’s last upgrade in November of last year. I would now recommend that people migrate from mysipswitch to sipsorcery for greater reliability. There were two cases where the mysipswitch service needed to be restarted in the last week due to the “Long Running Dialplan” issue and a failed automated restart. On average the mysipswitch does need one restart a week. If the restart happens to coincide with times when I or Guillaume are able to access the server, which is when we are not asleep and in my case at work, it’s fine. If it’s outside those times it can be up to 8 hours.
Update: Of course no sooner had I posted about stability there was a problem. Approximately 5 hours after posting the above the dial plan processing on the Primary App Server Worker failed with calls receiving the “Long Running Dialplan” log message. The memory utilisation of the App Server was low, around 120MB, and the process was responding normally, if it was not the Call Dispatcher process would have killed and recycled it. The thing that was failing was script executions by the DLR. This provides some new information and it now looks like there are two separate issues with dialplan processing. One is a memory leak when a process continuously executes DLR scripts. The second is a bug in the DLR that causes it to stop processing scripts altogether and possibly the result of an exception/stack overflow in a script. The memory leak issue has been resolved by recycling the App Server Workers when they reach 150MB. An additional mechanism is now needed to recycle the process if script executions fail.
As a some people have noticed there have been a few “improvements” going on with the sipsorcery service in the last few days. The main thrust of the improvements has been to move closer to a solution on the “long running dialplan/memory leak” issue that has plagued the mysipswitch and more recently sipsorcery services. The cause (and consequences) of the problem are discussed widely on the Forums site and also briefly in this Blog Post.
At this point I’m hopeful that the latest changes will finally solve the issue however I’ll wait for at least a week of stable behaviour before jumping to any substantial conclusions. Very briefly the new approach has been to give up on attempting to isolate the cause of the memory leak somewhere in the interaction between sipsorcery, the DLR and IronRuby and instead accept the leak but isolate it into a new process and recycle the process once it hits a certain memory utilisation.
In theory the idea doesn’t sound overly complex but it meant another round of pulling apart functions that were used to relying on being all within the same process. I’m getting fairly used to it at this stage though. The original mysipswitch service was all wrapped up nice and tightly in a single process. It was when the memory leak first cropped up that the extraction of different mysipswitch functions into different processes started and the single process application has now evolved into the system shown below.
Larger deployment diagram available here.
The trickiest thing ended up being processing calls forwarded to another sipsorcery user that need to use the called user’s dialplan, i.e. one call generating two or more dialplan instances. To be honest coding this up seriously hurt my brain. It starts off ok, one call arrives and drops into the dialplan, that dialplan calls a second sipsorcery user who has specified incoming calls go via one of their dialplans, the second dialplan calls out to a 3rd party SIP Provider. So far so good. But in order to generate the correct call detail records (CDRs) so that both the caller and called users have an accurate record the call between the two dialplans has to generate an extra two CDRs, to do that it means creating two additional SIP transactions. So that’s now 2 dialplan instances, 4 SIP calls and 4 SIP transactions. But then instead of calling an external provider the second dialplan could call another sipsorcery user who also uses an incoming dialplan. AHHHH -> Brian Pain. The flowchart of the whole things gives a bit of an idea of the complexity.
Larger state diagram available here.
So the upshot of all that is that the new isolated process mehanism is now in place and if all goes according to plan there will be no more outages caused by the dialplan processing memory leak (that’s a “hopefully” no more outages not a guarantee and only for the memory leak issue) . At the moment the dialplan worker process is set to recycle once it hits a working memory set of 150MB whenthe secondary worker process will take over until the primary one has completed the recycle. So far it’s working very well. There were a couple of minor hiccups today when the update went in. One caused the “Dial plan script engine was overloaded” error message but that was quickly resolved.
Apart from that a few other minor changes and miscellaneous points that have sprung to mind are:
Phew! That will do for now.
When I placed my sipsorcery test call this morning I got the fast busy tone. It wasn’t a shock given the issues that still exist with the dialplan processing and I assumed it must have been a memory or thread leak that got out of hand overnight.
When I logged into the sipsorcery server and checked the App Server process it’s resource utilisation was low which was a suprise. Checking the logs revealed an error message connecting to the MySQL database stating that the database already had too many connections.
That’s a new one. The issue was fixed by restarting the App Server so it was definitely that process that had a large number of connections open. The question is what opened them all? There is a new dialplan application that allows database queries but it should close each connection after use. This is another issue that will require some more investigation.
Update: Re-occurrence of this issue at around 25 Jul 2009 2200 UTC. I had left the MySQL admin console open so I could check which connections were open however it seems to have had its connection knocked off in the meantime so still none the wiser about the cause. Will now try restricting the number of connections on each of the different connection strings.
To date every attempt to solve or identify where the memory leak is coming from when processing Ruby dialplans has met with failure. That’s despite a concerted effort analysing and tracing the dialplan processing including the purchase of a relatively expensive memory profiling tool. The latest attempts seemed to be going somewhere initially by removing the Thread.Aborts – which seemed to be leaving something in the scripting engine in a bad state leading to the leak – and instead interrupting the execution of the engine by setting a variable.
The interrupt execution approach seemed to work fine both in testing and in the first few days on sipsorcery.com but then the leak re-appeared and even worse some weird things started happening where log messages from one dialplan execution appeared in another dialplan’s execution. That’s bad and my guess is that within the internals of the DLR (Dynamic Language Runtime) the script interpreters are being cached and interrupting the execution of the interpreters left some internal states awry.
It’s somewhat difficult to troubleshoot and test the dialplan execution because once the script starts getting interpreted then there is no easy way to debug through it. The lines of script get converted into lambda methods and executed as a series of instructions on the interpreter. It’s a bit like debugging a C application in assembly mode, i.e. next to impossible for all but the most trivial application.
I’ve put together a timeline of the issue so I can one day look back on it with fond memories when the issue is eventually solved.
I just did my customary sipsorcery test call after getting up this morning and it failed, damn. Terminal serviced into the sipsorcery server and the App Server was running with around 530MB of memory (way too much indicating a leak), 55 threads and 22% CPU. The memory and the threads are nothing new and they used to go a lot higher. However the CPU has never flat lined like that before.
I saved the logs and will trawl through them after work. My suspicion is that having removed the mechanisms to interrupt the execution of dialplan scripts to try and get rid of the memory leak and inconsistent dialplan processing a dialplan has been used that was long running and did need to be cancelled. Probably it was someone inadvertently putting their script into an infinite loop. To be honest I expected it to happen but was hoping for it to be later rather than sooner.
It’s back to the drawing board over this weekend to try and come up with a different solution for dial plan execution interrupt.
Sigh, sigh, sigh….
For over a year now the mysipswitch service has suffered from instability due to an issue isolated to the processing of Ruby dialplan scripts.
At the time the problem first came to my attention it took the best part of a month with lots of memory profiling to nail the issue down and then put in improvements to contain it. The symptoms of the problem were large memory consumption followed ultimately by the script processing failing to initialise and therefore going nowhere before being cancelled by the monitoring thread with the dreaded “Long running dialplan script was terminated” message.
The issue was mitigated against by separating out the mysipswitch functions into different processes and scheduling automatic restarts of the Application Server process that was responsible for the dialplans. This bought another 6 or so months of reliable use of mysipswitch but eventually as usage increased a daily auto-restart was not enough and more frequent restarts were required. Unfortunately that did not work well either as sometimes the Windows Service would fail to start properly and the restarts had to be left at every 24 hours.
The hope was that the problem in the IronRuby library would be fixed in the interim and would solve the issue on mysipswitch when it evolved to sipsorcery.
When the sipsorcery service went live the latest IronRuby source was downloaded from svn, built and deployed. It was a sinking feeling when the same memory leak was observed on the sipsorcery Application Server Process.
At that point I decided I’d have to bite the bullet and dive into the IronRuby code and see if I could find the bug responsbile for the leak. I’d avoided this in the past as I prefer to focus on adding new features or sorting out bugs in the sipsorcery code base rather than getting distracted on length forays into the 3rd party libraries.
The dive into IronRuby was launched and after 3 days of testing and scouring the code it ended up looking like the problem was not a memory leak being caused by a bug but instead was being caused by the use of Thread.Abort to cancel the dialplan scripts once a call was answered. This was a good discovery to make and at the time I thought the solution must be just around the corner, I was wrong.
Since Thread.Abort was causing the problem I had to look for an alternative way to ending the dialplan scripts. IronRuby runs on top of the Microsoft Dynamic Language Runtime so I started looking over the documentation for that and found a reference to “Interrupt Execution” which is exactly what was needed to halt the answered or long running scripts. This was another false dawn as that feature is not yet implemented in the DLR and is not on the current roadmap.
Being optimistic I thought I’d now delve into the DLR code and see if I could implement Interrupt Execution myself. Unsuprisingly it ended up being a lot more difficult than I thought but eventually I came up with a very crude mechanism where I would keep track of which script interpreters belonged to which dialplan and would then halt their execution once a call was answered. Initially that seemed to work well but then I noticed the interpreters seemed to be getting shared between multiple dialplans and halting the execution of one dialplan could end up halting another one that was running at the same time. This would also explain some behaviour that people were posting on the mysipswitch forums.
So that approach had to go and I have removed the code that was shutting down the interpreters. And…there is no And… that’s where things are right now. Currently there is no nice tidy mechanism to halt a dialplan after it has been answered. I have put a check into the sys.Dial command so that it will not run after the call has been answered but subsequent script commands will continue to be run until the end of the script.
That’s not the end of the World but it’s not ideal and also doesn’t cover the case where a user accidentally puts their script into an infinite loop. So it’s back to the drawing board and in the meantime you should check that your dialplan is ok to execute the commands after a call gets answered, most should be.