The SIPSorcery server experienced a 90 minute outage on the 15th of August at approximately 1700 PST. The first purpose of his blog post is to apologise to all people affected by the outage. The second is to provide a summation of the technical information that has been gathered about the cause of the outage.
The server event logs corresponding to the outage had numerous log messages as below.
Event ID: 333
An I/O operation initiated by the Registry failed unrecoverably. The Registry could not read in, or write out, or flush, one of the files that contain the system’s image of the Registry.
From lengthy research it appears the error message can be caused by a variety of error conditions on a Windows server but the predominant one is typically related to an exhaustion of resources on the underlying operating system. The resource could be pooled memory, handles, registry size limits and others. So far the only way found to recover from the issue is to reboot the server which is what happened with this outage.
One question is why would this issue crop up now when the SIPSorcery server has been running in the same configuration for over a year. The answer to that may lie in the fact that the server hardware was recently upgraded and at the same time the MySQL database version was updated from 5.1 to latest 5.5 version. It could be that a different hardware configuration, the new MySQL software, a combination of them or something else entirely has caused the issue to crop up.
The SIPSorcery server is closely monitored on a daily basis for performance characteristics such as CPU utilisation, memory utilisation, threads, SIP traffic, disk IO and more. However in this case no pre-emptive signs were recognised. At this point the prime suspect is the MySQL service and more detailed monitoring has now been put in place in order to track the resource usage of the MySQL process.
The short term goal is to identify the cause of the issue, whether it be related to MySQL or otherwise, and fix it. The medium term goal is to look into adding hardware redundancy to the SIPSorcery service. There will always be issues with server hardware, operating systems etc. and a single server system will always be vulnerable. Up until this point with the SIPSorcery service operating largely as a free offering it was not viable to add additional hardware to the platform. Now that the service is generating some revenue from Premium accounts there is scope to look at enhancing the platform. I will keep the blog updated with developments as they arise.
I apologise again to any users affected by today’s outage and a similar but shorter one on the 8th of August and would like to assure users that reliability is the top priority of the SIPSorcery service and is the focus of the majority of my efforts. There are also real-time status updates regarding the availability of the SIPSorcery service on this blog site at Status Graphs. A red line on the monitoring graphs indicates a simulated call request to a SIPSorcery application server timed out after 15s or did not get an appropriate response. There are now 3 remote monitoring servers sending probes to the SIPSorcery server and the fourth graph is for a monitoring service that runs on the server itself in order to distinguish between network and service problems. And in the event that an outage does occur I always endeavour to issue updates as frequently as possible on the SIPSorcery twitter account.