It’s now been 3 weeks since the Isolated Process dial plan processing mechanism was put in place on the sipsorcery service. The news on it is good and while there were a few tweaks required in the first couple of weeks, which were more down to preventing some users initiating 20+ simultaneous executions of their dialplans, in the last week there have been no software updates or restarts required. During that time the sipsorcery application server, which processes the dial plan executions and has been the trouble spot, operated smoothly with no issues.
As discussed ad-nauseum in the past the root cause of the reliability issue on the services is a memory leak either in the Dynamic Language Runtime (DLR) or in the integration between sipsorcery and the DLR. The solution has been to isolate the processing of the dialplans in separate process and perioidcally recycle those processes.
I now feel pretty comfortable about the reliability of the sipsorcery application server and am reasonably confident that a solution to the instability issue that has plagued mysipswitch and sipsorcery has been found, at least for sipsorcery. As also mentioned previously the mysipswitch service cannot be easily updated anymore since the code has diverged significantly since it’s last upgrade in November of last year. I would now recommend that people migrate from mysipswitch to sipsorcery for greater reliability. There were two cases where the mysipswitch service needed to be restarted in the last week due to the “Long Running Dialplan” issue and a failed automated restart. On average the mysipswitch does need one restart a week. If the restart happens to coincide with times when I or Guillaume are able to access the server, which is when we are not asleep and in my case at work, it’s fine. If it’s outside those times it can be up to 8 hours.
Update: Of course no sooner had I posted about stability there was a problem. Approximately 5 hours after posting the above the dial plan processing on the Primary App Server Worker failed with calls receiving the “Long Running Dialplan” log message. The memory utilisation of the App Server was low, around 120MB, and the process was responding normally, if it was not the Call Dispatcher process would have killed and recycled it. The thing that was failing was script executions by the DLR. This provides some new information and it now looks like there are two separate issues with dialplan processing. One is a memory leak when a process continuously executes DLR scripts. The second is a bug in the DLR that causes it to stop processing scripts altogether and possibly the result of an exception/stack overflow in a script. The memory leak issue has been resolved by recycling the App Server Workers when they reach 150MB. An additional mechanism is now needed to recycle the process if script executions fail.