Call-Dispatcher / Isolated-Process Dialplan Processing

To date every attempt to solve or identify where the memory leak is coming from when processing Ruby dialplans has met with failure. That’s despite a concerted effort analysing and tracing the dialplan processing including the purchase of a relatively expensive memory profiling tool. The latest attempts seemed to be going somewhere initially by removing the Thread.Aborts – which seemed to be leaving something in the scripting engine in a bad state leading to the leak – and instead interrupting the execution of the engine by setting a variable.

The interrupt execution approach seemed to work fine both in testing and in the first few days on sipsorcery.com but then the leak re-appeared and even worse some weird things started happening where¬† log messages from one dialplan execution appeared in another dialplan’s execution. That’s bad and my guess is that within the internals of the DLR (Dynamic Language Runtime) the script interpreters are being cached and interrupting the execution of the interpreters left some internal states awry.

It’s somewhat difficult to troubleshoot and test the dialplan execution because once the script starts getting interpreted then there is no easy way to debug through it. The lines of script get converted into lambda methods and executed as a series of instructions on the interpreter. It’s a bit like debugging a C application in assembly mode, i.e. next to impossible for all but the most trivial application.

I’ve put together a timeline of the issue so I can one day look back on it with fond memories when the issue is eventually solved.

  • May/Jun 2008 Memory leak first appeared on mysipswitch.com, mysipswitch forums: Instability Jun 2008
  • 6 Jul 2008 Isolated leak to the dialplan processing and IronRuby engine, IronRuby forums: Memory Leak with Certain Script Exceptions
  • 31 Jul 2008 Stage 1 of software upgrade to separate mysipswitch server agents mysipswitch forums: Software Update 31 Jul 2008
  • 6 Sep 2008 Stage 2 of software upgrade to separate mysipswitch server agents mysipswitch forums: SIPSwitch Upgrade – 6 Sep 2008 and mysipswitch blog: Pear Skidding
  • Sep 2008 to Jun 2009 sipsorcery.com upgrade from mysipswith.com under heavy development. No further investigation into the dialplan processing leak undertaken with the hope being that later versions of IronRuby, which is also under heavy development, would not have the same behaviour when interacting with sipsorcery,
  • 24 Jun 2009 sipsorcery.com went live,
  • 7 Jul 2009 call volume increases on sipsorcery.com and memory leak behaviour in dialplan processing observed. Undertook investigation of the leak this time debugging into the IronRuby and DLR libraries,
  • 9 Jul 2009 Hypothesis that Thread.Abort call to halt completed dialplans was causing the leak. Delved into DLR design and discovered “interrupt execution” which is the theoretical design solution to the problem IronRuby forums: Interrupt Execution and DLR forums: Interrupt Execution,
  • 11 Jul 2009 Upgraded sipsorcery.com to use an interrupt execution apporach in the DLR library and removed the Thread.Abort calls,
  • 15 Jul 2009 Memory leak behaviour on sipsorcery.com coupled with crossover of dialplan log messages,
  • 16 Jul 2009 Removed the interrupt execution changes made to the Microsoft.Scripting (DLR) assembly which means dialplans will not be terminated and will be left to run to completion,
  • 17 Jul 2009 Memory leak behaviour caused app server to become unresponsive and not process calls,
  • 18 Jul 2009 Initial implementation work for call-dispatcher/isolated-process approach for dialplan processing.
  • 23 Aug 2009 Dial plan processing failed (“Long Running Dialplan” message).
  1. Tuketu’s avatar

    If I had to say, based on your descriptions, it sounds like a situation where non-thread-safe objects are being created on one thread and then manipulated on a different thread. These kinds of situations can be hard to track down and will manifest themselves in odd ways. But hey, I could be way off base.

    Reply

  2. sipsorcery’s avatar

    That’s certainly possible but I haven’t been able to find any evidence of those type of objects. I’ve hooked up three different memory profilers at various points and there are no exotic objects in the profile. The largest memory consumers are strings and byte arrays both managed.

    I don’t know how well the memory profilers will work with the DLR though as it does some pretty funky stuff when intepreting/compiling the scripts.

    In addition I now have observed one case where the leak has manifested where the Thread.Abort mechanism wasn’t being used. So that puts a spanner in the works on the Thread.Abort leak theory.

    Reply

  3. Tuketu’s avatar

    Not sure if it’s at all practical (and it’s probably not), since I’m not versed with your software architecture, but if you could replace threads with processes (even as temporary testing/debugging situation), you would gain two things: 1) Any inappropriate object access would be immediately apparent, since the access to objects on different threads (now processes) would only work if appropriate marshaling/unmarshalling was in place. 2) Memory clean-up would handled for you as part of process clean-up. Thus if a dial-plan execution occurs on a new process, and if that process hangs, killing the process will obviate any memory leaks.

    Of course, this architecture is much more heavy weight, but if cpu (and working set size, not counting the leaks) is not a current limiting factor, what’s the concern?

    Again, probably not pracitcal given architectural tie-ins to threading, but even as a thought-process, it can sometimes help focus on areas where inappropriate (incorrect) cross-thread memory/object references are occuring.

    Reply

  4. sipsorcery’s avatar

    That’s exactly what I’m doing :). For the time being I’ve given up on tracking down the leak in the DLR and am going to instead create separate processes to do the dialplan processing. That’s where my “Call Dispatcher/Isolated Process” comes in. I’ll use a new dispatcher function to allocate out new calls to separate processes and they will do the dialplan processing. I’ve already got that working. What I still need to come up with is a nice way of monitoring the health of a dialplan processing process and to kill it off and start a new one when it starts leaking.

    Reply

  5. Tuketu’s avatar

    Ah, I didn’t get that from your blog account.

    Is each new phonecall a new process, or do processes handle more than one phonecall? Health monitoring: will a timeout value work, or is dialplan processing active for the entire length of a phonecall (which, of course, is of indeterminate length). If you have access to the dialplan processing code on some level, could it periodically communicate a wellness condition (similar to Win32 service start health monitoring) and you terminate the process if you have not received communication in a certain period of time? Could you replace some of suspect Ruby script calls with a new function that communicates wellness and then invokes the original Ruby code?

    You’re using some cool technology, good luck with that bleeding edge.

    Reply

  6. sipsorcery’s avatar

    The dialplan only executes up until a call is answered so generally they are quite short running in the order of 1 to 10 seconds. At the moment all the dialplans are processed in a single app server process with each one on a different thread. With the new “dispatcher” design I’m working on I’ll have multiple app server “worker” processes that can each handle multiple dialplan executions.

    For the health monitoring I have added a pre-configured Ruby dialplan to the C# code and I can get the health monitor process to initiate a periodic dummy call to that dialplan. If it gets through the dialplan execution then a specific type of SIP response is generated letting the caller know everything is ok. I also add some custom SIP headers to report things like memory usage to the health monitor.

    There is indeed a lot of bleeding edge code in use. Apart from the stuff I have cobbled together myself the DLR, IronRuby and DbLinq libraries would all fit into that category. There have been quite a few complaints about instability in the last 6 months but a lot of the time I’m actually suprised it is as stable as it is. Apart from the bleeding edge libraries you’ve also got users writing executable code on the fly which is always a hairy situation.

    Reply

Reply

Your email address will not be published. Required fields are marked *