sipsorcery's blog

Occassional posts about VoIP, SIP, WebRTC and Bitcoin.

sipsorcery.com response times SIP Sorcery Last 3 Hours
daily weekly
sipsorcery.com status

Building a video softphone part V

I’ve been able to sort of accomplish my goal of recording audio and video streams to the same MP4 file. The attached code sample does the job however there is something wrong with the way I’m doing the sampling as the audio and video are slightly out of sync and also the audio is truncated from the end. I though it was worth posting the sample though as it’s taken me a few days to finally get it working. The key was getting the audio’s media attributes correctly set for the writer.

// Configure the audio stream.
// See http://msdn.microsoft.com/en-us/library/windows/desktop/dd742785(v=vs.85).aspx for AAC encoder settings.
// http://msdn.microsoft.com/en-us/library/ff819476%28VS.85%29.aspx
CHECK_HR(MFCreateMediaType( pAudioOutType ), L"Configure encoder failed to create media type for audio output sink." );
CHECK_HR( pAudioOutType->SetGUID( MF_MT_MAJOR_TYPE, MFMediaType_Audio ), L"Failed to set audio writer attribute, media type." );
CHECK_HR( pAudioOutType->SetGUID( MF_MT_SUBTYPE, MFAudioFormat_AAC ), L"Failed to set audio writer attribute, audio format (AAC).");
CHECK_HR( pAudioOutType->SetUINT32( MF_MT_AUDIO_NUM_CHANNELS, 2 ), L"Failed to set audio writer attribute, number of channels." );
CHECK_HR( pAudioOutType->SetUINT32( MF_MT_AUDIO_BITS_PER_SAMPLE, 16 ), L"Failed to set audio writer attribute, bits per sample." );
CHECK_HR( pAudioOutType->SetUINT32( MF_MT_AUDIO_SAMPLES_PER_SECOND, 44100 ), L"Failed to set audio writer attribute, samples per second.");
CHECK_HR( pAudioOutType->SetUINT32( MF_MT_AUDIO_AVG_BYTES_PER_SECOND, 16000 ), L"Failed to set audio writer attribute, average bytes per second.");
//CHECK_HR( pAudioOutType->SetUINT32( MF_MT_AAC_AUDIO_PROFILE_LEVEL_INDICATION, 0x29 ), L"Failed to set audio writer attribute, level indication.");
CHECK_HR( pWriter->AddStream( pAudioOutType, audioStreamIndex ), L"Failed to add the audio stream to the sink writer.");
pAudioOutType->Release();

RecordMP4 code sample


Building a video softphone part IV

I haven’t made much progress since the last post except to determine that I was barking up the wrong tree by attempting to combine an the audio and video streams with the media foundation. I decided to check the RTP RFCs related to H.263 and H.264 to determine how the audio and video combination should be transmitted and it turns out they are pretty much independent. That means to start with I can use the existing softphone code for the audio side of things and use the Media Foundation to do the video encoding and decoding. I’m thinking of switching to H.263 for the first video encoding mechanism as it’s simpler than H.264 and will be easier to package up into RTP.

For the moment I will keep going with my attempt to get the Media Foundation to save a video and audio stream into a single .mp4 file as I think that will be a very useful piece of code to have around. The problem I’m having at the moment is getting the audio encoding working. The audio device I’m using returns a WAVE_FORMAT_IEEE_FLOAT stream but from what I can determine I need to convert it to something like MFAudioFormat_Dolby_AC3_SPDIF MFAudioFormat_AAC for MPEG4. I need to investigate that some more.

One added benefit of looking into how RTP transmits the audio and video streams is that I finally got around to getting a video call working between my desktop Bria softphone and the iPhone Bria version. I’ve never had much luck with video calls between Bria softphones in the past, the video would get through sporadically but not very reliably. So I was pleasantly surprised when with a few tweaks of the NAT settings on the iPhone version I was able to get a call working reliably. Although it’s still not quite perfect as the video streams will only work if the desktop softphone calls the iPhone softphone and not the other way around. Still it’s nice to see video calls supported through the SIPSorcery server with absolutely no server configuration required. That’s the advantage of a SIP server deployment with no media proxying or transcoding.


Building a video softphone part III.2

Capturing the and converting the video stream from my webcam to an H.264 file didn’t prove to be as bad as I thought. It did help a lot that the Media Foundation SDK has a sample called MFCaptureToFile that’s doing exactly the same thing and the vast majority I’ve used below has been copied and pasted directly from that sample.

Here’s the 5 second .mp4 video that represents the fruits of my labour.

The important code bits are shown below (just to reiterate I barely know what I’m doing with this stuff and generally remove the error checking and memory management code to help me get the gist of the bits I’m interested in).

// Initialize the Media Foundation platform.
hr = MFStartup(MF_VERSION);
if (SUCCEEDED(hr))
{
  WCHAR *pwszFileName = L"sample.mp4";
  IMFSinkWriter *pWriter;

  hr = MFCreateSinkWriterFromURL(
    pwszFileName,
    NULL,
    NULL,
    pWriter);

  // Create the source reader.
  IMFSourceReader *pReader;

  hr = MFCreateSourceReaderFromMediaSource(*ppSource, pConfig, pReader);

  //GetCurrentMediaType(pReader);
  //ListModes(pReader);

  pReader->GetCurrentMediaType((DWORD)MF_SOURCE_READER_FIRST_VIDEO_STREAM, pType);

  printf("Configuring H.264 sink.n");

  // Set up the H.264 sink.
  hr = ConfigureEncoder(pType, pWriter);
  if (FAILED(hr))
  {
    printf("Configuring the H.264 sink failed.n");
  }

  // Register the color converter DSP for this process, in the video
  // processor category. This will enable the sink writer to enumerate
  // the color converter when the sink writer attempts to match the
  // media types.

  hr = MFTRegisterLocalByCLSID(
    __uuidof(CColorConvertDMO),
    MFT_CATEGORY_VIDEO_PROCESSOR,
    L"",
    MFT_ENUM_FLAG_SYNCMFT,
    0,
    NULL,
    0,
    NULL);

  hr = pWriter->SetInputMediaType(0, pType, NULL);
  if (FAILED(hr))
  {
    printf("Failure setting the input media type on the H.264 sink.n");
  }

  hr = pWriter->BeginWriting();
  if (FAILED(hr))
  {
    printf("Failed to begin writing on the H.264 sink.n");
  }

  DWORD streamIndex, flags;
  LONGLONG llTimeSt
  IMFSample *pSample = NULL;
  CRITICAL_SECTION critsec;
  BOOL bFirstSample = TRUE;
  LONGLONG llBaseTime = 0;
  int sampleCount = 0;

  InitializeCriticalSection(critsec);

  printf("Recording...n");

  while (sampleCount < 100) {
    hr = pReader->ReadSample(
      MF_SOURCE_READER_ANY_STREAM, // Stream index.
      0, // Flags.
      streamIndex, // Receives the actual stream index.
      flags, // Receives status flags.
      llTimeStamp, // Receives the time stamp.
      pSample // Receives the sample or NULL.
    );

    wprintf(L"Stream %d (%I64d)n", streamIndex, llTimeStamp);

    if (pSample)
    {
      if (bFirstSample)
      {
        llBaseTime = llTimeSt
          bFirstSample = FALSE;
      }

      // rebase the time stamp
      llTimeStamp -= llBaseTime;

      hr = pSample->SetSampleTime(llTimeStamp);

      if (FAILED(hr))
      {
        printf("Set psample time failed.n");
      }

      hr = pWriter->WriteSample(0, pSample);

      if (FAILED(hr))
      {
        printf("Write sample failed.n");
      }
    }

    sampleCount++;
  }

  printf("Finalising the capture.");

  if (pWriter)
  {
    hr = pWriter->Finalize();
  }

  //WriteSampleToBitmap(pSample);

  // Shut down Media Foundation.
  MFShutdown();
}

HRESULT ConfigureEncoder(IMFMediaType *pType, IMFSinkWriter *pWriter)
{
  HRESULT hr = S_OK;

  IMFMediaType *pType2 = NULL;

  hr = MFCreateMediaType(pType2);

  if (SUCCEEDED(hr))
  {
    hr = pType2->SetGUID(MF_MT_MAJOR_TYPE, MFMediaType_Video);
  }

  if (SUCCEEDED(hr))
  {
    hr = pType2->SetGUID(MF_MT_SUBTYPE, MFVideoFormat_H264);
  }

  if (SUCCEEDED(hr))
  {
    hr = pType2->SetUINT32(MF_MT_AVG_BITRATE, 240 * 1000);
  }

  if (SUCCEEDED(hr))
  {
    hr = CopyAttribute(pType, pType2, MF_MT_FRAME_SIZE);
  }

  if (SUCCEEDED(hr))
  {
    hr = CopyAttribute(pType, pType2, MF_MT_FRAME_RATE);
  }

  if (SUCCEEDED(hr))
  {
    hr = CopyAttribute(pType, pType2, MF_MT_PIXEL_ASPECT_RATIO);
  }

  if (SUCCEEDED(hr))
  {
    hr = CopyAttribute(pType, pType2, MF_MT_INTERLACE_MODE);
  }

  if (SUCCEEDED(hr))
  {
    DWORD pdwStreamIndex = 0;
    hr = pWriter->AddStream(pType2, pdwStreamIndex);
  }

  pType2->Release();

  return hr;
}

HRESULT CopyAttribute(IMFAttributes *pSrc, IMFAttributes *pDest, const GUID key)
{
  PROPVARIANT var;
  PropVariantInit(var);

  HRESULT hr = S_OK;

  hr = pSrc->GetItem(key, var);
  if (SUCCEEDED(hr))
  {
    hr = pDest->SetItem(key, var);
  }

  PropVariantClear(var);
  return hr;
}

One thing that’s missing is audio. I’ve got the video into the .mp4 file but I need an audio stream in there as well.

The next step is to get audio in and then try and check that the media file will be understood by a different video softphone, probably Counterpath’s Bria since I already have that installed.


Building a video softphone part III

It feels like I’ve made a lot of progress from in the last few days although reflecting on what I’ve achieved I actually haven’t got much closer to the softphone goal. My two accomplishments, which seemed exciting at the time, were:

  • Successfully get a list of all the video modes that my webcam supports,
  • Get a video stream from my webcam and save a single frame as a bitmap.

The first step was to get an IMFSourceReader from the IMFMediaSource (my webcam) I created in the part II. My understanding of the way these two interfaces work is that IMFMediaSource is implemented by a class that wraps a device, file, network stream etc. that is capable of providing some audio or video and IMFSourceReader by the class that knows how to read samples from the media source.

The code I used to list my webcam’s video modes is shown below.

// Initialize the Media Foundation platform.
hr = MFStartup(MF_VERSION);
if (SUCCEEDED(hr))
{
  // Create the source reader.
  IMFSourceReader *pReader;

  hr = MFCreateSourceReaderFromMediaSource(*ppSource, pConfig, &amp; pReader);

  if (SUCCEEDED(hr))
  {
    while (SUCCEEDED(hr))
    {
      IMFMediaType *pType = NULL;
      hr = pReader->GetNativeMediaType(0, dwMediaTypeIndex, &pType);
      if (hr == MF_E_NO_MORE_TYPES)
      {
        hr = S_OK;
        break;
      }
      else if (SUCCEEDED(hr))
      {
        // Examine the media type.
        CMediaTypeTrace *nativeTypeMediaTrace = new CMediaTypeTrace(pType);
        printf("Native media type: %s.n", nativeTypeMediaTrace->GetString());
        pType->Release();
      }

      ++dwMediaTypeIndex;
    }
  }
}

The code in the snippet is just using the standard except for the CMediaTypeTrace class.That’s actually the useful class since it takes the IMFMediaType, which is mostly a bunch of GUIDs that map to constants to describe one of the webcam’s modes, and spits out some plain English to represent the resolution, format etc.of the webcam’s mode. The CMediaTypeTrace class is not actually in the Media Foundation library and instead is provided in mediatypetrace.h which is in one of the samples in the MediaFoundation directory that comes with the Windows SDK (on my system it’s in Windowsv7.1Samplesmultimediamediafoundationtopoedittedutil).As it happens the two video modes that my camera supports, RGB24 and I420, were not included in the list of GUIDs in mediatypetrace.h so I had to search around the place to find what they were and then add them in.

LPCSTR STRING_FROM_GUID(GUID Attr)
{
  ...
  INTERNAL_GUID_TO_STRING(MFVideoFormat_RGB24, 14); // RGB24
  INTERNAL_GUID_TO_STRING(WMMEDIASUBTYPE_I420, 15); // I420
}

A full list of the modes my web cam supports are listed below.

Device Name : Logitech QuickCam Pro 9000.
Current media type : Video: MAJOR_TYPE = Video, SUBTYPE = RGB24, FRAME_SIZE = W 640, H : 480.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = RGB24, FRAME_SIZE = W 640, H : 480.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = RGB24, FRAME_SIZE = W 160, H : 90.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = RGB24, FRAME_SIZE = W 160, H : 100.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = RGB24, FRAME_SIZE = W 160, H : 120.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = RGB24, FRAME_SIZE = W 176, H : 144.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = RGB24, FRAME_SIZE = W 320, H : 180.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = RGB24, FRAME_SIZE = W 320, H : 200.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = RGB24, FRAME_SIZE = W 320, H : 240.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = RGB24, FRAME_SIZE = W 352, H : 288.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = RGB24, FRAME_SIZE = W 640, H : 360.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = RGB24, FRAME_SIZE = W 640, H : 400.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = RGB24, FRAME_SIZE = W 864, H : 480.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = RGB24, FRAME_SIZE = W 768, H : 480.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = RGB24, FRAME_SIZE = W 800, H : 450.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = RGB24, FRAME_SIZE = W 800, H : 500.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = RGB24, FRAME_SIZE = W 800, H : 600.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = RGB24, FRAME_SIZE = W 960, H : 720.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = RGB24, FRAME_SIZE = W 1280, H : 720.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = RGB24, FRAME_SIZE = W 1280, H : 800.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = RGB24, FRAME_SIZE = W 1280, H : 1024.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = RGB24, FRAME_SIZE = W 1600, H : 900.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = RGB24, FRAME_SIZE = W 1600, H : 1000.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = RGB24, FRAME_SIZE = W 1600, H : 1200.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = I420, FRAME_SIZE = W 640, H : 480.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = I420, FRAME_SIZE = W 160, H : 90.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = I420, FRAME_SIZE = W 160, H : 100.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = I420, FRAME_SIZE = W 160, H : 120.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = I420, FRAME_SIZE = W 176, H : 144.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = I420, FRAME_SIZE = W 320, H : 180.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = I420, FRAME_SIZE = W 320, H : 200.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = I420, FRAME_SIZE = W 320, H : 240.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = I420, FRAME_SIZE = W 352, H : 288.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = I420, FRAME_SIZE = W 640, H : 360.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = I420, FRAME_SIZE = W 640, H : 400.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = I420, FRAME_SIZE = W 864, H : 480.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = I420, FRAME_SIZE = W 768, H : 480.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = I420, FRAME_SIZE = W 800, H : 450.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = I420, FRAME_SIZE = W 800, H : 500.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = I420, FRAME_SIZE = W 800, H : 600.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = I420, FRAME_SIZE = W 960, H : 720.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = I420, FRAME_SIZE = W 1280, H : 720.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = I420, FRAME_SIZE = W 1280, H : 800.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = I420, FRAME_SIZE = W 1280, H : 1024.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = I420, FRAME_SIZE = W 1600, H : 900.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = I420, FRAME_SIZE = W 1600, H : 1000.
Native media type : Video: MAJOR_TYPE = Video, SUBTYPE = I420, FRAME_SIZE = W 1600, H : 1200.

The second thing I was able to do was to take a sample from my webcam and save it as a bitmap.To do this I took a lot some short – cuts, namely hard coding the size of the sample, which I know from my webcam’s default mode (640 x 480), and relying on the fact that that mode does not result in any padding (I’m not 100 % on that and have taken an educated guess).I found someone else’s sample that created a bitmap file and blatantly copied it. Below is the code I used to extract the sample and save the bitmap.

// Initialize the Media Foundation platform.
hr = MFStartup(MF_VERSION);
if (SUCCEEDED(hr))
{
  // Create the source reader.
  IMFSourceReader *pReader;

  hr = MFCreateSourceReaderFromMediaSource(
    *ppSource,
    pConfig,
    &amp; pReader);

  //GetCurrentMediaType(pReader);
  //ListModes(pReader);

  DWORD streamIndex, flags;
  LONGLONG llTimeStamp;
  IMFSample *pSample = NULL;

  while (!pSample)
  {
    // Initial read results in a null pSample??
    hr = pReader - &gt; ReadSample(
      MF_SOURCE_READER_ANY_STREAM, // Stream index.
      0, // Flags.
      &streamIndex, // Receives the actual stream index.
      &flags, // Receives status flags.
      &llTimeStamp, // Receives the time stamp.
      &pSample // Receives the sample or NULL.
    );

    wprintf(L&quot; Stream %d(%I64d)n&quot; , streamIndex, llTimeStamp);
  }

  // Use non-2D version of sample.
  IMFMediaBuffer *mediaBuffer = NULL;
  BYTE *pData = NULL;
  DWORD writePosn = 0;

  pSample - &gt; ConvertToContiguousBuffer(&amp; mediaBuffer);

  hr = mediaBuffer - &gt; Lock(&amp; pData, NULL, NULL);

  HANDLE file = CreateBitmapFile(&amp; writePosn);

  WriteFile(file, pData, 640 * 480 * (24 / 8), &amp; writePosn, NULL);

  CloseHandle(file);

  mediaBuffer - &gt; Unlock();

  // Shut down Media Foundation.
  MFShutdown();
}

HANDLE CreateBitmapFile(DWORD *writePosn)
{
  HANDLE file;
  BITMAPFILEHEADER fileHeader;
  BITMAPINFOHEADER fileInfo;
  //DWORD write = 0;

  file = CreateFile(L&quot; sample.bmp&quot; , GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL); //Sets up the new bmp to be written to

  fileHeader.bfType = 19778; //Sets our type to BM or bmp
  fileHeader.bfSize = sizeof(fileHeader.bfOffBits) + sizeof(RGBTRIPLE); //Sets the size equal to the size of the header struct
  fileHeader.bfReserved1 = 0; //sets the reserves to 0
  fileHeader.bfReserved2 = 0;
  fileHeader.bfOffBits = sizeof(BITMAPFILEHEADER) + sizeof(BITMAPINFOHEADER); //Sets offbits equal to the size of file and info header

  fileInfo.biSize = sizeof(BITMAPINFOHEADER);
  fileInfo.biWidth = 640;
  fileInfo.biHeight = 480;
  fileInfo.biPlanes = 1;
  fileInfo.biBitCount = 24;
  fileInfo.biCompression = BI_RGB;
  fileInfo.biSizeImage = 640 * 480 * (24 / 8);
  fileInfo.biXPelsPerMeter = 2400;
  fileInfo.biYPelsPerMeter = 2400;
  fileInfo.biClrImportant = 0;
  fileInfo.biClrUsed = 0;

  WriteFile(file, &amp; fileHeader, sizeof(fileHeader), writePosn, NULL);
  WriteFile(file, &amp; fileInfo, sizeof(fileInfo), writePosn, NULL);

  return file;
}

So that was all fun but it hasn’t gotten me much closer to have an H.264 stream ready for bundling into my RTP packets. Getting the H.264 stream will be my next focus. I think I’ll try capturing it to an .mp4 file as a first step. Actually I wonder if there’s a way I can test an .mp4 file with a softphone and VLC? That would be a handy way to test if the H.264 stream I get is actually going to work when I use it in a VoIP call.

I also ordered Developing Microsoft Media Foundation Applications from Amazon thinking it might help only me on this journey only to find it available for free online a couple of days later :(.


Building a video softphone part II

Got the video device enumeration code working, at least well enough for it to tell me my webcam is a Logitech QuickCam Pro 9000. The working code is below.

#include "stdafx.h"
#include <mfapi.h>
#include <mfplay.h>
#include "common.h"

HRESULT CreateVideoDeviceSource(IMFMediaSource **ppSource);

int _tmain(int argc, _TCHAR* argv[])
{
  printf("Get webcam properties test console.n");

  CoInitializeEx(NULL, COINIT_APARTMENTTHREADED | COINIT_DISABLE_OLE1DDE);

  IMFMediaSource *ppSource = NULL;

  CreateVideoDeviceSource(&ppSource);

  getchar();

  return 0;
}

HRESULT CreateVideoDeviceSource(IMFMediaSource **ppSource)
{
  *ppSource = NULL;

  UINT32 count = 0;

  IMFAttributes *pConfig = NULL;
  IMFActivate **ppDevices = NULL;

  // Create an attribute store to hold the search criteria.
  HRESULT hr = MFCreateAttributes(&pConfig, 1);

  // Request video capture devices.
  if (SUCCEEDED(hr))
  {
    hr = pConfig->SetGUID(
      MF_DEVSOURCE_ATTRIBUTE_SOURCE_TYPE,
      MF_DEVSOURCE_ATTRIBUTE_SOURCE_TYPE_VIDCAP_GUID
    );
  }

  // Enumerate the devices,
  if (SUCCEEDED(hr))
  {
    hr = MFEnumDeviceSources(pConfig, &ppDevices, &count);
  }

  printf("Device Count: %i.n", count);

  // Create a media source for the first device in the list.
  if (SUCCEEDED(hr))
  {
    if (count > 0)
    {
      hr = ppDevices[0]->ActivateObject(IID_PPV_ARGS(ppSource));

      if (SUCCEEDED(hr))
      {
        WCHAR *szFriendlyName = NULL;

        // Try to get the display name.
        UINT32 cchName;
        hr = ppDevices[0]->GetAllocatedString(
          MF_DEVSOURCE_ATTRIBUTE_FRIENDLY_NAME,
          &szFriendlyName, &cchName);

        if (SUCCEEDED(hr))
        {
          wprintf(L"Device Name: %s.n", szFriendlyName);
        }
        else
        {
          printf("Error getting device attribute.");
        }

        CoTaskMemFree(szFriendlyName);
      }
    }
    else
    {
      hr = MF_E_NOT_FOUND;
    }
  }

  for (DWORD i = 0; i < count; i++)
  {
    ppDevices[i]->Release();
  }
  CoTaskMemFree(ppDevices);
  return hr;
}

The problem I had previously was a missing call to CoInitializeEx. It seems it’s needed to initialise things to allow the user of COM libraries.

The next step is now to work out how to get a sample from the webcam.


Building a video capable softphone with Windows Media Foundation

This is the first post in what will hopefully be a successful series of posts detailing how I manage to build a video capable softphone using Windows Media Foundation.

I’d never heard of Media Foundation until last week so not only do I not know how to use it I also don’t know it will be suitable for the task. I do know it is the successor to the  Windows DirectShow API but does not yet provide the same coverage so I may have to delve into the DirectShow API as well. On top of that neither API has a comprehensive managed .Net interface so that means the job needs to be done in C++. My C++ skills are severely undernourished so I’m expecting it to take a while to get up to speed before I can start really diving into the APIs.

What I do have is a basic working softphone that I can build on which means I can focus on the video side of things. My goal is to be able to place a SIP H.264 video call with my webcam to another video softphone, such as Counterpath’s Bria. Given other things going on at the moment, such as a 7 week old baby and a 3 year old, I’m estimating the project could take 2 to 3 months. As to why I’m interested in this it’s because it’s something different from both .Net and SIP. I’ve been working with both those for a long time so taking a break and playing with something different but still related is appealing.

Enough chit chat, getting started…

1. The first thing I’ve done is to install the Windows SDK for Windows 7 and take a look at the Media Foundation sample projects. The first sample I tried was the SimpleCapture project and it ran fine out of the box.

2. After looking through a few more of the samples I feel the need to get coding. Being able to get a video stream from my webcam is the obvious place to start. I’ve created a C++ Win32 console application and found an article which discusses enumerating the system’s video capture devices. I haven’t gotten very far as yet but I’m now wondering if my Logitech Webcam Pro 9000 driver supports H.264 meaning I wouldn’t need to use any of the Media Foundation H.264 codec capabilities? A quick look at the camera’s specification page and I’m pretty sure the answer is no.

3. I’ve now got the sample compiling and running but the count of my video devices is coming back as 0 🙁 so I’ve probably got some flags wrong somewhere.

The first couple of hours hasn’t got me very far yet. More tomorrow.


.Net Softphone

For some reason after being completely disinterested in doing anything with the RTP and audio side of VoIP calls for the last 5 or so years suddenly in the last month I decided to explore how well a .Net based softphone would work. Consequently I started tinkering around with a .Net library called NAudio that I’d seen mentioned around the traps. For my purposes NAudio provided a convenient way to get at the underlying Windows API calls for interacting with audio input and output devices. It took a little bit of time and effort to get things working but eventually I was able to successfully read audio samples from my microphone and write samples to my speakers through a test .Net application.

The softphone is open source and available in a binary form here and the source is availabe here in the sipsorcery-softphone project. Before going any further it should be noted that the softphone is extremely rudimentary and geared towards developers or VoIP hobbyists wanting to tinker rather than end users looking for trouble free calling. The user interface is extremely lacking and there are also crucial components missing such as echo cancellation, a jitter buffer, codec support (G.711 u-law is the only codec supported) etc.

My original verdict on using .Net as a softphone platform was that it was not particularly good. This was due to the fact that the microphone samples coming from NAudio were only capable of being delivered with a sample period of 200ms which is useless since the in practice the jitter buffer at the remote end will drop any packet over 50 or 100ms. However it turned out that a combination of some inefficient code in my RTP packet parsing and the fact that I was testing by running the softphone in Visual Studio debug mode was responsible for the high sampling latency. Once those issues were removed the microphone samples have been delivered reliably with a sample period of 20ms exactly as required. I was thinking i I ever wanted to have a usable softphone I’d have to move the RTP and audio processing to a C++ library but now I’m starting to believe that’s not necessary and .Net is capable of handling the 20ms sample period.

The other thing worth mentioning about the softphone is that it’s capable of placing calls directly to Google Voice’s XMPP gateway. I’m still surprised that none of the mainstream softphone developers have bothered to add the STUN bindings to their RTP stacks so that they could work with Google Voice. In the end I decided I’d just prototype it myself just for kicks. For a softphone that already has RTP and STUN protocol support adding the ability to work with Google Voice in conjunction with a SIP-to-XMPP gateway (which SIPSorcery coudl do) would literally be less than 20 lines of code.

Hopefully the softphone will be useful to someone. Judging on the number of queries I get about the SIPSorcery softphone project and the questions about .Net softphones on stackoverflow I imagine it will be.


SIP and Audio

I’ve created a short guide on how SIP manages audio streams and the sorts of things that go wrong when those streams traverse NATs. The full guide can be read at SIP and Audio Guide.

To complement the guide I’ve whipped together a diagnostics tool.

SIPSorcery RTP Diagnostics Tool

In an attempt to help people diagnose RTP audio issues I have created a new tool that provides some simple diagnostic messages about receiving and transmitting RTP packets from a SIP device. The purpose of the tool is twofold:

  1. On a SIP call indicate the expected socket the RTP packets were expected from and the actual socket they came from,
  2. On a SIP call indicate whether it was possible to transmit RTP packets to the same socket the SIP caller was sending from.

To use the tool take the following steps:

  1. Open http://diags.sipsorcery.com in a browser and click the Go button. Note that the web page uses web sockets which are only supported in the latest web browsers, I’ve tested it in Chrome 16, Firefox 9.0.1, Internet Explorer 9,
  2. A message will be displayed that contains a SIP address to call. Type that into your softphone or set up a SIPSorcery dialplan rule to call it,
  3. If the tool receives a call on the SIP address it will display information about how it received and sent RTP packets.

The tool is very rudimentary at this point but if it proves useful I will be likely to expend more effort to polish and enhance it. If you do have any feedback or feature requests please do send me an email at aaron@sipsorcery.com.


SIP Password Security – How much is yours worth?

SIP uses a cryptographic algorithm called MD5 for authentication however MD5 was invented in 1991 and since that time a number of flaws have been exposed in it. The US Computer Emergency Readiness Team (US-CERT) issued a vulnerability notice in 2008 that included the quote below.

Do not use the MD5 algorithm
Software developers, Certification Authorities, website owners, and users should avoid using the MD5 algorithm in any capacity. As previous research has demonstrated, it should be considered cryptographically broken and unsuitable for further use.

Does that mean SIP’s authentication mechanism is vulnerable? While not necessarily so, at least in relation to the MD5 flaws, the real answer is it depends on how much your password is worth to an attacker? For example if your SIP password only uses alphabetic characters and is 7 characters or less in length it can be brute forced for less than $1!

Read the full article here.


New Web Callback Feature

Due to popular request, mainly from Voxalot refugees, a new web callback feature is now available for SIPSorcery Premium and Professional users. The feature is available on the AJAX portal. Unlike the original call manager approach (outline at the bottom of this page) which initiated a Ruby dial plan execution and did not require authentication the new mechanism DOES require authentication and sets up a call between two pre-configured dial strings rather than executing an existing dial plan.

The new mechanism is simpler to use but is not as powerful and flexible as the original approach. Hopefully the new mechanism is closer to what Voxalot refugees are used to and will allow any saved Voxalot callbacks to be used.

There is help available but the mechanism should be fairly intuitive to use.  The way it works is that you enter in two dial strings (dial strings are the same format as those that can be used in sys.Dial in Ruby dial plans and can include multiple call legs and other options) and a description. After that it’s just a matter of clicking on “place call” and the SIPSorcery server will attempt to call the first leg and if it gets an answer will then call the second leg and finally bridge the calls together with a SIP re-INVITE.

Enjoy!