Occassional posts about VoIP, SIP, WebRTC and Bitcoin.
After a brief interlude (nearly 3 years) I recently got motivated to look at using the SIPSorcery SIP stack to build a video softphone. Of course there are already a number of fully featured video softphones available so the project was for fun rather than to solve any particular problem.
Unlike my previous attempts this time I have been successful and the image below shows the video softphone prototype on a call with CounterPath’s Bria Softphone (that’s me chatting to Max).
I did end up using Microsoft’s Media Foundation (MF) for the getting samples from my web cam but I gave up on trying to use the MF H264 codec and instead used the VP8 codec from the webproject. A motivation to use the VP8 codec is that it was the initial codec proposed for WebRTC and at some point I’d like to experiment with placing calls from the softphone to a browser.
The video softphone is available in at the sipsorcery codeplex repository under the sipsorcery-softphonev2 folder. All the MF and libvpx integration is contained in the sipsorcery-media folder.
The new softphone is purely experimental and video calls do not even work with the only other softphone I tested with, jitsi, due to a VP8 framing problem. But it is a working implementation of a Windows video softphone so may prove useful for anyone who wants to do some work in those areas.
I haven’t made much progress since the last post except to determine that I was barking up the wrong tree by attempting to combine an the audio and video streams with the media foundation. I decided to check the RTP RFCs related to H.263 and H.264 to determine how the audio and video combination should be transmitted and it turns out they are pretty much independent. That means to start with I can use the existing softphone code for the audio side of things and use the Media Foundation to do the video encoding and decoding. I’m thinking of switching to H.263 for the first video encoding mechanism as it’s simpler than H.264 and will be easier to package up into RTP.
For the moment I will keep going with my attempt to get the Media Foundation to save a video and audio stream into a single .mp4 file as I think that will be a very useful piece of code to have around. The problem I’m having at the moment is getting the audio encoding working. The audio device I’m using returns a WAVE_FORMAT_IEEE_FLOAT stream but from what I can determine I need to convert it to something like
MFAudioFormat_Dolby_AC3_SPDIF MFAudioFormat_AAC for MPEG4. I need to investigate that some more.
One added benefit of looking into how RTP transmits the audio and video streams is that I finally got around to getting a video call working between my desktop Bria softphone and the iPhone Bria version. I’ve never had much luck with video calls between Bria softphones in the past, the video would get through sporadically but not very reliably. So I was pleasantly surprised when with a few tweaks of the NAT settings on the iPhone version I was able to get a call working reliably. Although it’s still not quite perfect as the video streams will only work if the desktop softphone calls the iPhone softphone and not the other way around. Still it’s nice to see video calls supported through the SIPSorcery server with absolutely no server configuration required. That’s the advantage of a SIP server deployment with no media proxying or transcoding.
For some reason after being completely disinterested in doing anything with the RTP and audio side of VoIP calls for the last 5 or so years suddenly in the last month I decided to explore how well a .Net based softphone would work. Consequently I started tinkering around with a .Net library called NAudio that I’d seen mentioned around the traps. For my purposes NAudio provided a convenient way to get at the underlying Windows API calls for interacting with audio input and output devices. It took a little bit of time and effort to get things working but eventually I was able to successfully read audio samples from my microphone and write samples to my speakers through a test .Net application.
The softphone is open source and available in a binary form here and the source is availabe here in the sipsorcery-softphone project. Before going any further it should be noted that the softphone is extremely rudimentary and geared towards developers or VoIP hobbyists wanting to tinker rather than end users looking for trouble free calling. The user interface is extremely lacking and there are also crucial components missing such as echo cancellation, a jitter buffer, codec support (G.711 u-law is the only codec supported) etc.
My original verdict on using .Net as a softphone platform was that it was not particularly good. This was due to the fact that the microphone samples coming from NAudio were only capable of being delivered with a sample period of 200ms which is useless since the in practice the jitter buffer at the remote end will drop any packet over 50 or 100ms. However it turned out that a combination of some inefficient code in my RTP packet parsing and the fact that I was testing by running the softphone in Visual Studio debug mode was responsible for the high sampling latency. Once those issues were removed the microphone samples have been delivered reliably with a sample period of 20ms exactly as required. I was thinking i I ever wanted to have a usable softphone I’d have to move the RTP and audio processing to a C++ library but now I’m starting to believe that’s not necessary and .Net is capable of handling the 20ms sample period.
The other thing worth mentioning about the softphone is that it’s capable of placing calls directly to Google Voice’s XMPP gateway. I’m still surprised that none of the mainstream softphone developers have bothered to add the STUN bindings to their RTP stacks so that they could work with Google Voice. In the end I decided I’d just prototype it myself just for kicks. For a softphone that already has RTP and STUN protocol support adding the ability to work with Google Voice in conjunction with a SIP-to-XMPP gateway (which SIPSorcery coudl do) would literally be less than 20 lines of code.
Hopefully the softphone will be useful to someone. Judging on the number of queries I get about the SIPSorcery softphone project and the questions about .Net softphones on stackoverflow I imagine it will be.
I’ve created a short guide on how SIP manages audio streams and the sorts of things that go wrong when those streams traverse NATs. The full guide can be read at SIP and Audio Guide.
To complement the guide I’ve whipped together a diagnostics tool.
SIPSorcery RTP Diagnostics Tool
In an attempt to help people diagnose RTP audio issues I have created a new tool that provides some simple diagnostic messages about receiving and transmitting RTP packets from a SIP device. The purpose of the tool is twofold:
To use the tool take the following steps:
The tool is very rudimentary at this point but if it proves useful I will be likely to expend more effort to polish and enhance it. If you do have any feedback or feature requests please do send me an email at firstname.lastname@example.org.
SIP uses a cryptographic algorithm called MD5 for authentication however MD5 was invented in 1991 and since that time a number of flaws have been exposed in it. The US Computer Emergency Readiness Team (US-CERT) issued a vulnerability notice in 2008 that included the quote below.
Do not use the MD5 algorithm
Software developers, Certification Authorities, website owners, and users should avoid using the MD5 algorithm in any capacity. As previous research has demonstrated, it should be considered cryptographically broken and unsuitable for further use.
Does that mean SIP’s authentication mechanism is vulnerable? While not necessarily so, at least in relation to the MD5 flaws, the real answer is it depends on how much your password is worth to an attacker? For example if your SIP password only uses alphabetic characters and is 7 characters or less in length it can be brute forced for less than $1!
Read the full article here.
There was an announcement a while ago on the xmpp.org site about Google’s adoption of the official Jingle standard. The same announcement was posted by Peter Thatcher a Google engineer to the Jingle mailing list. The announcement got a bit of attention and I originally came across it on Hacker News. After spending a fair bit of time looking into what the upgrade means as far as integrating with Google’s XMPP service the conclusion I’ve come to is not much at least not yet.
The original reason I, a SIP person, started messing around with XMPP and Jingle was to see if there was a better way to integrate with Google Voice. I made a few blog posts at the time and ultimately the investigation didn’t end up yielding any fruit because of a peculiarity in the way the Google XMPP server deals with the RTP media streams which would preclude it working with standard SIP devices (very briefly it requires that STUN packets are exchanged on the RTP sockets before the RTP can flow).
So fast forward 8 months and I finally found some time to delve into the Google XMPP service again to see if there’s any chance of getting SIP interop, or more correctly RTP interop for SIP devices, working. The other thing that spurred me on a bit was Google + and specifically the integration of the Google Talk Gadget into the Google + application. The Google Talk Gadget is not new it’s been in Gmail for quite a while but the integration looks a lot more interesting with Google +, specifically being able to call from a SIP phone into a Google + Hangout (a web based voice/video conference call) would be cool.
As alluded to above the results of my latest little adventure into Google XMPP land haven’t been very fruitful. Yes Google now appear to be sort of using Jingle for their call set up but the media side of things doesn’t appear to have changed at all and in fact the original email from Peter Thatcher did state the work on XEP-0176: Jingle ICE-UDP Transport Method, which is more down to the nuts and bolts of the RTP side of things, is ongoing. In the meantime the XMPP call set up mechanisms used by Google are a combination of Jingle and Gingle (Google’s original implementation of a Jingle like protocol) and crucially the media layer is still Gingle.
My hope is that when Google get the Jingle ICE-UDP transport implemented they will also implement the XEP-0177: Jingle Raw UDP Transport Method which should allow RTP level interop with SIP devices which will in turn allow SIP Proxy services like SIP Sorcery to make and receive calls with Google’s XMPP network. Some SIP servers that proxy media such as Asterisk and FreeSWITCH are already integrating with Google’s XMPP network by implementing the Gingle/XEP-0176 STUN connectivity check mechanisms on the RTP sockets. In SIP Sorcery’s case and also for other non-media proxying servers such as Kamailio the integration is not possible because it then needs the end SIP devices to do the STUN connectivity checks and until ICE support becomes widespread they wont be able to.
Unfortunately this is another example of the really painful issues caused by NAT. If NAT had never been invented and the World had been forced to upgrade to IPv6 the internet would be a much better place :). In the meantime communications protocols like SIP and XMPP end up with as many addendums and extensions to deal with NAT as they do to deal with their core functions!