April 2012

You are currently browsing the monthly archive for April 2012.

I’ve been able to sort of accomplish my goal of recording audio and video streams to the same MP4 file. The attached code sample does the job however there is something wrong with the way I’m doing the sampling as the audio and video are slightly out of sync and also the audio is truncated from the end. I though it was worth posting the sample though as it’s taken me a few days to finally get it working. The key was getting the audio’s media attributes correctly set for the writer.

// Configure the audio stream.
// See http://msdn.microsoft.com/en-us/library/windows/desktop/dd742785(v=vs.85).aspx for AAC encoder settings.
// http://msdn.microsoft.com/en-us/library/ff819476%28VS.85%29.aspx
CHECK_HR(MFCreateMediaType( pAudioOutType ), L"Configure encoder failed to create media type for audio output sink." );
CHECK_HR( pAudioOutType->SetGUID( MF_MT_MAJOR_TYPE, MFMediaType_Audio ), L"Failed to set audio writer attribute, media type." );  
CHECK_HR( pAudioOutType->SetGUID( MF_MT_SUBTYPE, MFAudioFormat_AAC ), L"Failed to set audio writer attribute, audio format (AAC).");
CHECK_HR( pAudioOutType->SetUINT32( MF_MT_AUDIO_NUM_CHANNELS, 2 ), L"Failed to set audio writer attribute, number of channels." );
CHECK_HR( pAudioOutType->SetUINT32( MF_MT_AUDIO_BITS_PER_SAMPLE, 16 ), L"Failed to set audio writer attribute, bits per sample." );
CHECK_HR( pAudioOutType->SetUINT32( MF_MT_AUDIO_SAMPLES_PER_SECOND, 44100 ), L"Failed to set audio writer attribute, samples per second.");
CHECK_HR( pAudioOutType->SetUINT32( MF_MT_AUDIO_AVG_BYTES_PER_SECOND, 16000 ), L"Failed to set audio writer attribute, average bytes per second.");
//CHECK_HR( pAudioOutType->SetUINT32( MF_MT_AAC_AUDIO_PROFILE_LEVEL_INDICATION, 0x29 ), L"Failed to set audio writer attribute, level indication.");
CHECK_HR( pWriter->AddStream( pAudioOutType, audioStreamIndex ), L"Failed to add the audio stream to the sink writer.");
pAudioOutType->Release();

RecordMP4 code sample

I haven’t made much progress since the last post except to determine that I was barking up the wrong tree by attempting to combine an the audio and video streams with the media foundation. I decided to check the RTP RFCs related to H.263 and H.264 to determine how the audio and video combination should be transmitted and it turns out they are pretty much independent. That means to start with I can use the existing softphone code for the audio side of things and use the Media Foundation to do the video encoding and decoding. I’m thinking of switching to H.263 for the first video encoding mechanism as it’s simpler than H.264 and will be easier to package up into RTP.

For the moment I will keep going with my attempt to get the Media Foundation to save a video and audio stream into a single .mp4 file as I think that will be a very useful piece of code to have around. The problem I’m having at the moment is getting the audio encoding working. The audio device I’m using returns a WAVE_FORMAT_IEEE_FLOAT stream but from what I can determine I need to convert it to something like MFAudioFormat_Dolby_AC3_SPDIF MFAudioFormat_AAC for MPEG4. I need to investigate that some more.

One added benefit of looking into how RTP transmits the audio and video streams is that I finally got around to getting a video call working between my desktop Bria softphone and the iPhone Bria version. I’ve never had much luck with video calls between Bria softphones in the past, the video would get through sporadically but not very reliably. So I was pleasantly surprised when with a few tweaks of the NAT settings on the iPhone version I was able to get a call working reliably. Although it’s still not quite perfect as the video streams will only work if the desktop softphone calls the iPhone softphone and not the other way around. Still it’s nice to see video calls supported through the SIPSorcery server with absolutely no server configuration required. That’s the advantage of a SIP server deployment with no media proxying or transcoding.

Capturing the and converting the video stream from my webcam to an H.264 file didn’t prove to be as bad as I thought. It did help a lot that the Media Foundation SDK has a sample called MFCaptureToFile that’s doing exactly the same thing and the vast majority I’ve used below has been copied and pasted directly from that sample.

Here’s the 5 second .mp4 video that represents the fruits of my labour.

The important code bits are shown below (just to reiterate I barely know what I’m doing with this stuff and generally remove the error checking and memory management code to help me get the gist of the bits I’m interested in).


// Initialize the Media Foundation platform.
hr = MFStartup(MF_VERSION);
if (SUCCEEDED(hr))
{
WCHAR *pwszFileName = L"sample.mp4";
IMFSinkWriter *pWriter;

hr = MFCreateSinkWriterFromURL(
pwszFileName,
NULL,
NULL,
pWriter);

// Create the source reader.
IMFSourceReader *pReader;

hr = MFCreateSourceReaderFromMediaSource(*ppSource, pConfig, pReader);

//GetCurrentMediaType(pReader);
//ListModes(pReader);

pReader->GetCurrentMediaType((DWORD)MF_SOURCE_READER_FIRST_VIDEO_STREAM, pType);

printf("Configuring H.264 sink.n");

// Set up the H.264 sink.
hr = ConfigureEncoder(pType, pWriter);
if (FAILED(hr))
{
printf("Configuring the H.264 sink failed.n");
}

// Register the color converter DSP for this process, in the video
// processor category. This will enable the sink writer to enumerate
// the color converter when the sink writer attempts to match the
// media types.

hr = MFTRegisterLocalByCLSID(
__uuidof(CColorConvertDMO),
MFT_CATEGORY_VIDEO_PROCESSOR,
L"",
MFT_ENUM_FLAG_SYNCMFT,
0,
NULL,
0,
NULL);

hr = pWriter->SetInputMediaType(0, pType, NULL);
if (FAILED(hr))
{
printf("Failure setting the input media type on the H.264 sink.n");
}

hr = pWriter->BeginWriting();
if (FAILED(hr))
{
printf("Failed to begin writing on the H.264 sink.n");
}

DWORD streamIndex, flags;
LONGLONG llTimeSt
IMFSample *pSample = NULL;
CRITICAL_SECTION critsec;
BOOL bFirstSample = TRUE;
LONGLONG llBaseTime = 0;
int sampleCount = 0;

InitializeCriticalSection(critsec);

printf("Recording...n");

while(sampleCount < 100)
{
hr = pReader->ReadSample(
MF_SOURCE_READER_ANY_STREAM, // Stream index.
0, // Flags.
streamIndex, // Receives the actual stream index.
flags, // Receives status flags.
llTimeStamp, // Receives the time stamp.
pSample // Receives the sample or NULL.
);

wprintf(L"Stream %d (%I64d)n", streamIndex, llTimeStamp);

if (pSample)
{
if (bFirstSample)
{
llBaseTime = llTimeSt
bFirstSample = FALSE;
}

// rebase the time stamp
llTimeStamp -= llBaseTime;

hr = pSample->SetSampleTime(llTimeStamp);

if (FAILED(hr))
{
printf("Set psample time failed.n");
}

hr = pWriter->WriteSample(0, pSample);

if (FAILED(hr))
{
printf("Write sample failed.n");
}
}

sampleCount++;
}

printf("Finalising the capture.");

if (pWriter)
{
hr = pWriter->Finalize();
}

//WriteSampleToBitmap(pSample);

// Shut down Media Foundation.
MFShutdown();
}

HRESULT ConfigureEncoder(IMFMediaType *pType, IMFSinkWriter *pWriter)
{
HRESULT hr = S_OK;

IMFMediaType *pType2 = NULL;

hr = MFCreateMediaType(pType2);

if (SUCCEEDED(hr))
{
hr = pType2->SetGUID( MF_MT_MAJOR_TYPE, MFMediaType_Video );
}

if (SUCCEEDED(hr))
{
hr = pType2->SetGUID(MF_MT_SUBTYPE, MFVideoFormat_H264);
}

if (SUCCEEDED(hr))
{
hr = pType2->SetUINT32(MF_MT_AVG_BITRATE, 240 * 1000);
}

if (SUCCEEDED(hr))
{
hr = CopyAttribute(pType, pType2, MF_MT_FRAME_SIZE);
}

if (SUCCEEDED(hr))
{
hr = CopyAttribute(pType, pType2, MF_MT_FRAME_RATE);
}

if (SUCCEEDED(hr))
{
hr = CopyAttribute(pType, pType2, MF_MT_PIXEL_ASPECT_RATIO);
}

if (SUCCEEDED(hr))
{
hr = CopyAttribute(pType, pType2, MF_MT_INTERLACE_MODE);
}

if (SUCCEEDED(hr))
{
DWORD pdwStreamIndex = 0;
hr = pWriter->AddStream(pType2, pdwStreamIndex);
}

pType2->Release();

return hr;
}

HRESULT CopyAttribute(IMFAttributes *pSrc, IMFAttributes *pDest, const GUID key)
{
PROPVARIANT var;
PropVariantInit(var);

HRESULT hr = S_OK;

hr = pSrc->GetItem(key, var);
if (SUCCEEDED(hr))
{
hr = pDest->SetItem(key, var);
}

PropVariantClear(var);
return hr;
}

One thing that’s missing is audio. I’ve got the video into the .mp4 file but I need an audio stream in there as well.

The next step is to get audio in and then try and check that the media file will be understood by a different video softphone, probably Counterpath’s Bria since I already have that installed.

It feels like I’ve made a lot of progress from in the last few days although reflecting on what I’ve achieved I actually haven’t got much closer to the softphone goal. My two accomplishments, which seemed exciting at the time, were:

  • Successfully get a list of all the video modes that my webcam supports,
  • Get a video stream from my webcam and save a single frame as a bitmap.

The first step was to get an IMFSourceReader from the IMFMediaSource (my webcam) I created in the part II. My understanding of the way these two interfaces work is that IMFMediaSource is implemented by a class that wraps a device, file, network stream etc. that is capable of providing some audio or video and IMFSourceReader by the class that knows how to read samples from the media source.

The code I used to list my webcam’s video modes is shown below.

// Initialize the Media Foundation platform.
hr = MFStartup(MF_VERSION);
if (SUCCEEDED(hr))
{
    // Create the source reader.
    IMFSourceReader *pReader;
 
    hr = MFCreateSourceReaderFromMediaSource(*ppSource,	pConfig, &amp;amp;pReader);

    if (SUCCEEDED(hr))
    {
        while (SUCCEEDED(hr))
       {
            IMFMediaType *pType = NULL;
            hr = pReader->GetNativeMediaType(0, dwMediaTypeIndex, &pType);
            if (hr == MF_E_NO_MORE_TYPES)
            {
                hr = S_OK;
                break;
            }
            else if (SUCCEEDED(hr))
            {
                // Examine the media type. 
                CMediaTypeTrace *nativeTypeMediaTrace = new CMediaTypeTrace(pType);
                printf("Native media type: %s.n", nativeTypeMediaTrace->GetString());
                pType->Release();
            }

            ++dwMediaTypeIndex;
        }
    }
}

The code in the snippet is just using the standard except for the CMediaTypeTrace class. That’s actually the useful class since it takes the IMFMediaType, which is mostly a bunch of GUIDs that map to constants to describe one of the webcam’s modes, and spits out some plain English to represent the resolution, format etc. of the webcam’s mode. The CMediaTypeTrace class is not actually in the Media Foundation library and instead is provided in mediatypetrace.h which is in one of the samples in the MediaFoundation directory that comes with the Windows SDK (on my system it’s in Windowsv7.1Samplesmultimediamediafoundationtopoedittedutil). As it happens the two video modes that my camera supports, RGB24 and I420, were not included in the list of GUIDs in mediatypetrace.h so I had to search around the place to find what they were and then add them in.

LPCSTR STRING_FROM_GUID( GUID Attr )
{
    ...
    INTERNAL_GUID_TO_STRING( MFVideoFormat_RGB24, 14 );	  // RGB24
    INTERNAL_GUID_TO_STRING( WMMEDIASUBTYPE_I420, 15 );   // I420
}

A full list of the modes my web cam supports are listed below.

Device Name: Logitech QuickCam Pro 9000.
Current media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 640, H: 480.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 640, H: 480.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 160, H: 90.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 160, H: 100.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 160, H: 120.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 176, H: 144.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 320, H: 180.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 320, H: 200.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 320, H: 240.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 352, H: 288.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 640, H: 360.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 640, H: 400.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 864, H: 480.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 768, H: 480.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 800, H: 450.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 800, H: 500.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 800, H: 600.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 960, H: 720.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 1280, H: 720.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 1280, H: 800.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 1280, H: 1024.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 1600, H: 900.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 1600, H: 1000.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 1600, H: 1200.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 640, H: 480.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 160, H: 90.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 160, H: 100.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 160, H: 120.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 176, H: 144.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 320, H: 180.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 320, H: 200.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 320, H: 240.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 352, H: 288.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 640, H: 360.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 640, H: 400.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 864, H: 480.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 768, H: 480.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 800, H: 450.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 800, H: 500.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 800, H: 600.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 960, H: 720.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 1280, H: 720.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 1280, H: 800.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 1280, H: 1024.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 1600, H: 900.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 1600, H: 1000.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 1600, H: 1200.

The second thing I was able to do was to take a sample from my webcam and save it as a bitmap. To do this I took a lot some short-cuts, namely hard coding the size of the sample, which I know from my webcam’s default mode (640 x 480), and relying on the fact that that mode does not result in any padding (I’m not 100% on that and have taken an educated guess). I found someone else’s sample that created a bitmap file and blatantly copied it. Below is the code I used to extract the sample and save the bitmap.

// Initialize the Media Foundation platform.
hr = MFStartup(MF_VERSION);
if (SUCCEEDED(hr))
{
	// Create the source reader.
	IMFSourceReader *pReader;

	hr = MFCreateSourceReaderFromMediaSource(
		*ppSource,
		pConfig,
		&amp;amp;pReader);

	//GetCurrentMediaType(pReader);
	//ListModes(pReader);
				
	DWORD streamIndex, flags;
	LONGLONG llTimeStamp;
	IMFSample *pSample = NULL;

	while(!pSample)
	{
		// Initial read results in a null pSample??
		hr = pReader-&amp;gt;ReadSample(
			MF_SOURCE_READER_ANY_STREAM,    // Stream index.
			0,                              // Flags.
			&amp;amp;streamIndex,                   // Receives the actual stream index. 
			&amp;amp;flags,                         // Receives status flags.
			&amp;amp;llTimeStamp,                   // Receives the time stamp.
			&amp;amp;pSample                        // Receives the sample or NULL.
			);

		wprintf(L&amp;quot;Stream %d (%I64d)n&amp;quot;, streamIndex, llTimeStamp);
	}

	// Use non-2D version of sample.
	IMFMediaBuffer *mediaBuffer = NULL;
	BYTE *pData = NULL;
	DWORD writePosn = 0;

	pSample-&amp;gt;ConvertToContiguousBuffer(&amp;amp;mediaBuffer);

	hr = mediaBuffer-&amp;gt;Lock(&amp;amp;pData, NULL, NULL);

	HANDLE file = CreateBitmapFile(&amp;amp;writePosn);

	WriteFile(file, pData, 640 * 480 * (24/8), &amp;amp;writePosn, NULL);

	CloseHandle(file);

	mediaBuffer-&amp;gt;Unlock();

	// Shut down Media Foundation.
	MFShutdown();
}

HANDLE CreateBitmapFile(DWORD *writePosn)
{
	HANDLE file;
	BITMAPFILEHEADER fileHeader;
	BITMAPINFOHEADER fileInfo;
	//DWORD write = 0;
 
	file = CreateFile(L&amp;quot;sample.bmp&amp;quot;,GENERIC_WRITE,0,NULL,CREATE_ALWAYS,FILE_ATTRIBUTE_NORMAL,NULL);  //Sets up the new bmp to be written to
 
	fileHeader.bfType = 19778;                                                                    //Sets our type to BM or bmp
	fileHeader.bfSize = sizeof(fileHeader.bfOffBits) + sizeof(RGBTRIPLE);                                                //Sets the size equal to the size of the header struct
	fileHeader.bfReserved1 = 0;                                                                    //sets the reserves to 0
	fileHeader.bfReserved2 = 0;
	fileHeader.bfOffBits = sizeof(BITMAPFILEHEADER)+sizeof(BITMAPINFOHEADER);                    //Sets offbits equal to the size of file and info header
 
	fileInfo.biSize = sizeof(BITMAPINFOHEADER);
	fileInfo.biWidth = 640;
	fileInfo.biHeight = 480;
	fileInfo.biPlanes = 1;
	fileInfo.biBitCount = 24;
	fileInfo.biCompression = BI_RGB;
	fileInfo.biSizeImage = 640 * 480 * (24/8);
	fileInfo.biXPelsPerMeter = 2400;
	fileInfo.biYPelsPerMeter = 2400;
	fileInfo.biClrImportant = 0;
	fileInfo.biClrUsed = 0;
 
	WriteFile(file, &amp;amp;fileHeader, sizeof(fileHeader), writePosn, NULL);
	WriteFile(file, &amp;amp;fileInfo, sizeof(fileInfo), writePosn, NULL);

	return file;
}

So that was all fun but it hasn’t gotten me much closer to have an H.264 stream ready for bundling into my RTP packets. Getting the H.264 stream will be my next focus. I think I’ll try capturing it to an .mp4 file as a first step. Actually I wonder if there’s a way I can test an .mp4 file with a softphone and VLC? That would be a handy way to test if the H.264 stream I get is actually going to work when I use it in a VoIP call.

I also ordered Developing Microsoft Media Foundation Applications from Amazon thinking it might help only me on this journey only to find it available for free online a couple of days later :(.

Got the video device enumeration code working, at least well enough for it to tell me my webcam is a Logitech QuickCam Pro 9000. The working code is below.

#include &quot;stdafx.h&quot;
#include &lt;mfapi.h&gt;
#include &lt;mfplay.h&gt;
#include &quot;common.h&quot;

HRESULT CreateVideoDeviceSource(IMFMediaSource **ppSource);

int _tmain(int argc, _TCHAR* argv[])
{
	printf(&quot;Get webcam properties test console.n&quot;);

	CoInitializeEx(NULL, COINIT_APARTMENTTHREADED | COINIT_DISABLE_OLE1DDE);
	
	IMFMediaSource *ppSource = NULL;

    CreateVideoDeviceSource(&amp;ppSource);

	getchar();

	return 0;
}

HRESULT CreateVideoDeviceSource(IMFMediaSource **ppSource)
{
     *ppSource = NULL;

    UINT32 count = 0;

    IMFAttributes *pConfig = NULL;
    IMFActivate **ppDevices = NULL;

    // Create an attribute store to hold the search criteria.
    HRESULT hr = MFCreateAttributes(&amp;pConfig, 1);

    // Request video capture devices.
    if (SUCCEEDED(hr))
    {
        hr = pConfig-&gt;SetGUID(
            MF_DEVSOURCE_ATTRIBUTE_SOURCE_TYPE, 
            MF_DEVSOURCE_ATTRIBUTE_SOURCE_TYPE_VIDCAP_GUID
            );
    }

    // Enumerate the devices,
    if (SUCCEEDED(hr))
    {
        hr = MFEnumDeviceSources(pConfig, &amp;ppDevices, &amp;count);
    }

	printf(&quot;Device Count: %i.n&quot;, count);

    // Create a media source for the first device in the list.
    if (SUCCEEDED(hr))
    {
        if (count &gt; 0)
        {
            hr = ppDevices[0]-&gt;ActivateObject(IID_PPV_ARGS(ppSource));

			if (SUCCEEDED(hr))
			{
				WCHAR *szFriendlyName = NULL;
    
				// Try to get the display name.
				UINT32 cchName;
				hr = ppDevices[0]-&gt;GetAllocatedString(
					MF_DEVSOURCE_ATTRIBUTE_FRIENDLY_NAME,
					&amp;szFriendlyName, &amp;cchName);

				if (SUCCEEDED(hr))
				{
					wprintf(L&quot;Device Name: %s.n&quot;, szFriendlyName);
				}
				else
				{
					printf(&quot;Error getting device attribute.&quot;);
				}

				CoTaskMemFree(szFriendlyName);
			}
		}
        else
        {
            hr = MF_E_NOT_FOUND;
        }
    }

    for (DWORD i = 0; i &lt; count; i++)
    {
        ppDevices[i]-&gt;Release();
    }
    CoTaskMemFree(ppDevices);
    return hr;
}

The problem I had previously was a missing call to CoInitializeEx. It seems it’s needed to initialise things to allow the user of COM libraries.

The next step is now to work out how to get a sample from the webcam.

This is the first post in what will hopefully be a successful series of posts detailing how I manage to build a video capable softphone using Windows Media Foundation.

I’d never heard of Media Foundation until last week so not only do I not know how to use it I also don’t know it will be suitable for the task. I do know it is the successor to the  Windows DirectShow API but does not yet provide the same coverage so I may have to delve into the DirectShow API as well. On top of that neither API has a comprehensive managed .Net interface so that means the job needs to be done in C++. My C++ skills are severely undernourished so I’m expecting it to take a while to get up to speed before I can start really diving into the APIs.

What I do have is a basic working softphone that I can build on which means I can focus on the video side of things. My goal is to be able to place a SIP H.264 video call with my webcam to another video softphone, such as Counterpath’s Bria. Given other things going on at the moment, such as a 7 week old baby and a 3 year old, I’m estimating the project could take 2 to 3 months. As to why I’m interested in this it’s because it’s something different from both .Net and SIP. I’ve been working with both those for a long time so taking a break and playing with something different but still related is appealing.

Enough chit chat, getting started…

1. The first thing I’ve done is to install the Windows SDK for Windows 7 and take a look at the Media Foundation sample projects. The first sample I tried was the SimpleCapture project and it ran fine out of the box.

2. After looking through a few more of the samples I feel the need to get coding. Being able to get a video stream from my webcam is the obvious place to start. I’ve created a C++ Win32 console application and found an article which discusses enumerating the system’s video capture devices. I haven’t gotten very far as yet but I’m now wondering if my Logitech Webcam Pro 9000 driver supports H.264 meaning I wouldn’t need to use any of the Media Foundation H.264 codec capabilities? A quick look at the camera’s specification page and I’m pretty sure the answer is no.

3. I’ve now got the sample compiling and running but the count of my video devices is coming back as 0 🙁 so I’ve probably got some flags wrong somewhere.

The first couple of hours hasn’t got me very far yet. More tomorrow.

For some reason after being completely disinterested in doing anything with the RTP and audio side of VoIP calls for the last 5 or so years suddenly in the last month I decided to explore how well a .Net based softphone would work. Consequently I started tinkering around with a .Net library called NAudio that I’d seen mentioned around the traps. For my purposes NAudio provided a convenient way to get at the underlying Windows API calls for interacting with audio input and output devices. It took a little bit of time and effort to get things working but eventually I was able to successfully read audio samples from my microphone and write samples to my speakers through a test .Net application.

The softphone is open source and available in a binary form here and the source is availabe here in the sipsorcery-softphone project. Before going any further it should be noted that the softphone is extremely rudimentary and geared towards developers or VoIP hobbyists wanting to tinker rather than end users looking for trouble free calling. The user interface is extremely lacking and there are also crucial components missing such as echo cancellation, a jitter buffer, codec support (G.711 u-law is the only codec supported) etc.

My original verdict on using .Net as a softphone platform was that it was not particularly good. This was due to the fact that the microphone samples coming from NAudio were only capable of being delivered with a sample period of 200ms which is useless since the in practice the jitter buffer at the remote end will drop any packet over 50 or 100ms. However it turned out that a combination of some inefficient code in my RTP packet parsing and the fact that I was testing by running the softphone in Visual Studio debug mode was responsible for the high sampling latency. Once those issues were removed the microphone samples have been delivered reliably with a sample period of 20ms exactly as required. I was thinking i I ever wanted to have a usable softphone I’d have to move the RTP and audio processing to a C++ library but now I’m starting to believe that’s not necessary and .Net is capable of handling the 20ms sample period.

The other thing worth mentioning about the softphone is that it’s capable of placing calls directly to Google Voice’s XMPP gateway. I’m still surprised that none of the mainstream softphone developers have bothered to add the STUN bindings to their RTP stacks so that they could work with Google Voice. In the end I decided I’d just prototype it myself just for kicks. For a softphone that already has RTP and STUN protocol support adding the ability to work with Google Voice in conjunction with a SIP-to-XMPP gateway (which SIPSorcery coudl do) would literally be less than 20 lines of code.

Hopefully the softphone will be useful to someone. Judging on the number of queries I get about the SIPSorcery softphone project and the questions about .Net softphones on stackoverflow I imagine it will be.