Video

You are currently browsing the archive for the Video category.

With WebRTC starting to gain momentum I have been doing a bit of work recently on integrating a custom .Net/C++ application with WebRTC capable browsers (at the time of writing Chrome and FireFox). It’s been challenging work and there are a lot of moving parts involved with WebRTC. The good thing is once that once the integration challenges have been overcome WebRTC streaming works very well and my experience so far of WebRTC video streams are a lot better than those I’ve had with SIP video soft phones.

The SIPSorcery code base now has all the components needed to develop prototype applications that can integrate with WebRTC browsers. An example of a WebRTC video test pattern can be found here. While it is a very rudimentary application that simply adds a timestamp to a static JPEG image it does demonstrate the network and encoding pipeline necessary to get media streams working between a .Net application and a WebRTC browser.

The good thing is that if you have a project that needs .Net and WebRTC you might be able to use the SIPSorcery code as your starting point and perhaps even contribute to it. To create your own WebRTC project you need to add the SIPSorcery and SIPSorceryMedia nuget packages. The SIPSorceryMedia package uses some native and C++ dll’s and does require the Visual C++ Redistributable Packages for Visual Studio 2013 to be installed on any machine that executes it. It’s also built for x64 only so won’t run on 32 bit systems (I can build it for other platforms if anyone ever has a need).

An example of how to use the WebRTC components is in the source code of the WebRTCVideoSample project referenced above. The main classes are WebRTCDaemon and WebRtcSession. The WebRtcSession class is where the ICE connection establishment is coordinated and the connection to the WebRTC browser is set up. The WebRTCDaemon does the web socket signalling, which is not part of WebRTC but does provide a handy way to get the session established, and also the VP8 encoding of the timestamped JPEG image. I’ve included the WebRtcSession code below to show the pretty small amount of code required.

using System;
using System.Linq;
using System.Net;
using System.Net.Sockets;
using SIPSorceryMedia;
using SIPSorcery.Net;
using SIPSorcery.Sys;
using log4net;

namespace WebRTCVideoServer
{
    public class WebRtcSession
    {
        private const int RTP_MAX_PAYLOAD = 1400;
        private const int TIMESTAMP_SPACING = 3000;
        private const int PAYLOAD_TYPE_ID = 100;
        private const int SRTP_AUTH_KEY_LENGTH = 10;

        private static ILog logger = AppState.logger;

        public WebRtcPeer Peer;
        public DtlsManaged DtlsContext;
        public SRTPManaged SrtpContext;
        public SRTPManaged SrtpReceiveContext;  // Used to decrypt packets received from the remote peer.

        public string CallID
        {
            get { return Peer.CallID; }
        }

        public WebRtcSession(string callID)
        {
            Peer = new WebRtcPeer() { CallID = callID };
        }

        public void DtlsPacketReceived(IceCandidate iceCandidate, byte[] buffer, IPEndPoint remoteEndPoint)
        {
            logger.Debug("DTLS packet received " + buffer.Length + " bytes from " + remoteEndPoint.ToString() + ".");

            if (DtlsContext == null)
            {
                DtlsContext = new DtlsManaged();
                int res = DtlsContext.Init();
                logger.Debug("DtlsContext initialisation result=" + res);
            }

            int bytesWritten = DtlsContext.Write(buffer, buffer.Length);

            if (bytesWritten != buffer.Length)
            {
                logger.Warn("The required number of bytes were not successfully written to the DTLS context.");
            }
            else
            {
                byte[] dtlsOutBytes = new byte[2048];

                int bytesRead = DtlsContext.Read(dtlsOutBytes, dtlsOutBytes.Length);

                if (bytesRead == 0)
                {
                    logger.Debug("No bytes read from DTLS context :(.");
                }
                else
                {
                    logger.Debug(bytesRead + " bytes read from DTLS context sending to " + remoteEndPoint.ToString() + ".");
                    iceCandidate.LocalRtpSocket.SendTo(dtlsOutBytes, 0, bytesRead, SocketFlags.None, remoteEndPoint);

                    //if (client.DtlsContext.IsHandshakeComplete())
                    if (DtlsContext.GetState() == 3)
                    {
                        logger.Debug("DTLS negotiation complete for " + remoteEndPoint.ToString() + ".");
                        SrtpContext = new SRTPManaged(DtlsContext, false);
                        SrtpReceiveContext = new SRTPManaged(DtlsContext, true);
                        Peer.IsDtlsNegotiationComplete = true;
                        iceCandidate.RemoteRtpEndPoint = remoteEndPoint;
                    }
                }
            }
        }

        public void MediaPacketReceived(IceCandidate iceCandidate, byte[] buffer, IPEndPoint remoteEndPoint)
        {
            if ((buffer[0] >= 128) &amp;amp;&amp;amp; (buffer[0] <= 191))
            {
                //logger.Debug("A non-STUN packet was received Receiver Client.");

                if (buffer[1] == 0xC8 /* RTCP SR */ || buffer[1] == 0xC9 /* RTCP RR */)
                {
                    // RTCP packet.
                    //webRtcClient.LastSTUNReceiveAt = DateTime.Now;
                }
                else
                {
                    // RTP packet.
                    //int res = peer.SrtpReceiveContext.UnprotectRTP(buffer, buffer.Length);

                    //if (res != 0)
                    //{
                    //    logger.Warn("SRTP unprotect failed, result " + res + ".");
                    //}
                }
            }
            else
            {
                logger.Debug("An unrecognised packet was received on the WebRTC media socket.");
            }
        }

        public void Send(byte[] buffer)
        {
            try
            {
                Peer.LastTimestamp = (Peer.LastTimestamp == 0) ? RTSPSession.DateTimeToNptTimestamp32(DateTime.Now) : Peer.LastTimestamp + TIMESTAMP_SPACING;

                for (int index = 0; index * RTP_MAX_PAYLOAD < buffer.Length; index++)
                {
                    int offset = (index == 0) ? 0 : (index * RTP_MAX_PAYLOAD);
                    int payloadLength = (offset + RTP_MAX_PAYLOAD < buffer.Length) ? RTP_MAX_PAYLOAD : buffer.Length - offset;

                    byte[] vp8HeaderBytes = (index == 0) ? new byte[] { 0x10 } : new byte[] { 0x00 };

                    RTPPacket rtpPacket = new RTPPacket(payloadLength + SRTP_AUTH_KEY_LENGTH + vp8HeaderBytes.Length);
                    rtpPacket.Header.SyncSource = Peer.SSRC;
                    rtpPacket.Header.SequenceNumber = Peer.SequenceNumber++;
                    rtpPacket.Header.Timestamp = Peer.LastTimestamp;
                    rtpPacket.Header.MarkerBit = ((offset + payloadLength) >= buffer.Length) ? 1 : 0; // Set marker bit for the last packet in the frame.
                    rtpPacket.Header.PayloadType = PAYLOAD_TYPE_ID;

                    Buffer.BlockCopy(vp8HeaderBytes, 0, rtpPacket.Payload, 0, vp8HeaderBytes.Length);
                    Buffer.BlockCopy(buffer, offset, rtpPacket.Payload, vp8HeaderBytes.Length, payloadLength);

                    var rtpBuffer = rtpPacket.GetBytes();

                    int rtperr = SrtpContext.ProtectRTP(rtpBuffer, rtpBuffer.Length - SRTP_AUTH_KEY_LENGTH);
                    if (rtperr != 0)
                    {
                        logger.Warn("SRTP packet protection failed, result " + rtperr + ".");
                    }
                    else
                    {
                        var connectedIceCandidate = Peer.LocalIceCandidates.Where(y => y.RemoteRtpEndPoint != null).First();
                        connectedIceCandidate.LocalRtpSocket.SendTo(rtpBuffer, connectedIceCandidate.RemoteRtpEndPoint);
                    }
                }
            }
            catch (Exception sendExcp)
            {
                // logger.Error("SendRTP exception sending to " + client.SocketAddress + ". " + sendExcp.Message);
            }
        }
    }
}

The SIPSorceryMedia assembly utilises the Cisco libsrtp (for secure RTP), openssl (for DTLS) and libvpx (for VP8 encoding). It also makes use of Microsoft’s Media Foundation and the ffmpeg libraries for various bits and pieces.

Tags: , ,

After a brief interlude (nearly 3 years) I recently got motivated to look at using the SIPSorcery SIP stack to build a video softphone. Of course there are already a number of fully featured video softphones available so the project was for fun rather than to solve any particular problem.

Unlike my previous attempts this time I have been successful and the image below shows the video softphone prototype on a call with CounterPath’s Bria Softphone (that’s me chatting to Max).

maxchat

I did end up using Microsoft’s Media Foundation (MF) for the getting samples from my web cam but I gave up on trying to use the MF H264 codec and instead used the VP8 codec from the webproject. A motivation to use the VP8 codec is that it was the initial codec proposed for WebRTC and at some point I’d like to experiment with placing calls from the softphone to a browser.

The video softphone is available in at the sipsorcery codeplex repository under the sipsorcery-softphonev2 folder. All the MF and libvpx integration is contained in the sipsorcery-media folder.

The new softphone is purely experimental and video calls do not even work with the only other softphone I tested with, jitsi, due to a VP8 framing problem. But it is a working implementation of a Windows video softphone so may prove useful for anyone who wants to do some work in those areas.

Enjoy.

I’ve been able to sort of accomplish my goal of recording audio and video streams to the same MP4 file. The attached code sample does the job however there is something wrong with the way I’m doing the sampling as the audio and video are slightly out of sync and also the audio is truncated from the end. I though it was worth posting the sample though as it’s taken me a few days to finally get it working. The key was getting the audio’s media attributes correctly set for the writer.

// Configure the audio stream.
// See http://msdn.microsoft.com/en-us/library/windows/desktop/dd742785(v=vs.85).aspx for AAC encoder settings.
// http://msdn.microsoft.com/en-us/library/ff819476%28VS.85%29.aspx
CHECK_HR(MFCreateMediaType( pAudioOutType ), L"Configure encoder failed to create media type for audio output sink." );
CHECK_HR( pAudioOutType->SetGUID( MF_MT_MAJOR_TYPE, MFMediaType_Audio ), L"Failed to set audio writer attribute, media type." );  
CHECK_HR( pAudioOutType->SetGUID( MF_MT_SUBTYPE, MFAudioFormat_AAC ), L"Failed to set audio writer attribute, audio format (AAC).");
CHECK_HR( pAudioOutType->SetUINT32( MF_MT_AUDIO_NUM_CHANNELS, 2 ), L"Failed to set audio writer attribute, number of channels." );
CHECK_HR( pAudioOutType->SetUINT32( MF_MT_AUDIO_BITS_PER_SAMPLE, 16 ), L"Failed to set audio writer attribute, bits per sample." );
CHECK_HR( pAudioOutType->SetUINT32( MF_MT_AUDIO_SAMPLES_PER_SECOND, 44100 ), L"Failed to set audio writer attribute, samples per second.");
CHECK_HR( pAudioOutType->SetUINT32( MF_MT_AUDIO_AVG_BYTES_PER_SECOND, 16000 ), L"Failed to set audio writer attribute, average bytes per second.");
//CHECK_HR( pAudioOutType->SetUINT32( MF_MT_AAC_AUDIO_PROFILE_LEVEL_INDICATION, 0x29 ), L"Failed to set audio writer attribute, level indication.");
CHECK_HR( pWriter->AddStream( pAudioOutType, audioStreamIndex ), L"Failed to add the audio stream to the sink writer.");
pAudioOutType->Release();

RecordMP4 code sample

I haven’t made much progress since the last post except to determine that I was barking up the wrong tree by attempting to combine an the audio and video streams with the media foundation. I decided to check the RTP RFCs related to H.263 and H.264 to determine how the audio and video combination should be transmitted and it turns out they are pretty much independent. That means to start with I can use the existing softphone code for the audio side of things and use the Media Foundation to do the video encoding and decoding. I’m thinking of switching to H.263 for the first video encoding mechanism as it’s simpler than H.264 and will be easier to package up into RTP.

For the moment I will keep going with my attempt to get the Media Foundation to save a video and audio stream into a single .mp4 file as I think that will be a very useful piece of code to have around. The problem I’m having at the moment is getting the audio encoding working. The audio device I’m using returns a WAVE_FORMAT_IEEE_FLOAT stream but from what I can determine I need to convert it to something like MFAudioFormat_Dolby_AC3_SPDIF MFAudioFormat_AAC for MPEG4. I need to investigate that some more.

One added benefit of looking into how RTP transmits the audio and video streams is that I finally got around to getting a video call working between my desktop Bria softphone and the iPhone Bria version. I’ve never had much luck with video calls between Bria softphones in the past, the video would get through sporadically but not very reliably. So I was pleasantly surprised when with a few tweaks of the NAT settings on the iPhone version I was able to get a call working reliably. Although it’s still not quite perfect as the video streams will only work if the desktop softphone calls the iPhone softphone and not the other way around. Still it’s nice to see video calls supported through the SIPSorcery server with absolutely no server configuration required. That’s the advantage of a SIP server deployment with no media proxying or transcoding.

Capturing the and converting the video stream from my webcam to an H.264 file didn’t prove to be as bad as I thought. It did help a lot that the Media Foundation SDK has a sample called MFCaptureToFile that’s doing exactly the same thing and the vast majority I’ve used below has been copied and pasted directly from that sample.

Here’s the 5 second .mp4 video that represents the fruits of my labour.

The important code bits are shown below (just to reiterate I barely know what I’m doing with this stuff and generally remove the error checking and memory management code to help me get the gist of the bits I’m interested in).


// Initialize the Media Foundation platform.
hr = MFStartup(MF_VERSION);
if (SUCCEEDED(hr))
{
WCHAR *pwszFileName = L"sample.mp4";
IMFSinkWriter *pWriter;

hr = MFCreateSinkWriterFromURL(
pwszFileName,
NULL,
NULL,
pWriter);

// Create the source reader.
IMFSourceReader *pReader;

hr = MFCreateSourceReaderFromMediaSource(*ppSource, pConfig, pReader);

//GetCurrentMediaType(pReader);
//ListModes(pReader);

pReader->GetCurrentMediaType((DWORD)MF_SOURCE_READER_FIRST_VIDEO_STREAM, pType);

printf("Configuring H.264 sink.n");

// Set up the H.264 sink.
hr = ConfigureEncoder(pType, pWriter);
if (FAILED(hr))
{
printf("Configuring the H.264 sink failed.n");
}

// Register the color converter DSP for this process, in the video
// processor category. This will enable the sink writer to enumerate
// the color converter when the sink writer attempts to match the
// media types.

hr = MFTRegisterLocalByCLSID(
__uuidof(CColorConvertDMO),
MFT_CATEGORY_VIDEO_PROCESSOR,
L"",
MFT_ENUM_FLAG_SYNCMFT,
0,
NULL,
0,
NULL);

hr = pWriter->SetInputMediaType(0, pType, NULL);
if (FAILED(hr))
{
printf("Failure setting the input media type on the H.264 sink.n");
}

hr = pWriter->BeginWriting();
if (FAILED(hr))
{
printf("Failed to begin writing on the H.264 sink.n");
}

DWORD streamIndex, flags;
LONGLONG llTimeSt
IMFSample *pSample = NULL;
CRITICAL_SECTION critsec;
BOOL bFirstSample = TRUE;
LONGLONG llBaseTime = 0;
int sampleCount = 0;

InitializeCriticalSection(critsec);

printf("Recording...n");

while(sampleCount < 100)
{
hr = pReader->ReadSample(
MF_SOURCE_READER_ANY_STREAM, // Stream index.
0, // Flags.
streamIndex, // Receives the actual stream index.
flags, // Receives status flags.
llTimeStamp, // Receives the time stamp.
pSample // Receives the sample or NULL.
);

wprintf(L"Stream %d (%I64d)n", streamIndex, llTimeStamp);

if (pSample)
{
if (bFirstSample)
{
llBaseTime = llTimeSt
bFirstSample = FALSE;
}

// rebase the time stamp
llTimeStamp -= llBaseTime;

hr = pSample->SetSampleTime(llTimeStamp);

if (FAILED(hr))
{
printf("Set psample time failed.n");
}

hr = pWriter->WriteSample(0, pSample);

if (FAILED(hr))
{
printf("Write sample failed.n");
}
}

sampleCount++;
}

printf("Finalising the capture.");

if (pWriter)
{
hr = pWriter->Finalize();
}

//WriteSampleToBitmap(pSample);

// Shut down Media Foundation.
MFShutdown();
}

HRESULT ConfigureEncoder(IMFMediaType *pType, IMFSinkWriter *pWriter)
{
HRESULT hr = S_OK;

IMFMediaType *pType2 = NULL;

hr = MFCreateMediaType(pType2);

if (SUCCEEDED(hr))
{
hr = pType2->SetGUID( MF_MT_MAJOR_TYPE, MFMediaType_Video );
}

if (SUCCEEDED(hr))
{
hr = pType2->SetGUID(MF_MT_SUBTYPE, MFVideoFormat_H264);
}

if (SUCCEEDED(hr))
{
hr = pType2->SetUINT32(MF_MT_AVG_BITRATE, 240 * 1000);
}

if (SUCCEEDED(hr))
{
hr = CopyAttribute(pType, pType2, MF_MT_FRAME_SIZE);
}

if (SUCCEEDED(hr))
{
hr = CopyAttribute(pType, pType2, MF_MT_FRAME_RATE);
}

if (SUCCEEDED(hr))
{
hr = CopyAttribute(pType, pType2, MF_MT_PIXEL_ASPECT_RATIO);
}

if (SUCCEEDED(hr))
{
hr = CopyAttribute(pType, pType2, MF_MT_INTERLACE_MODE);
}

if (SUCCEEDED(hr))
{
DWORD pdwStreamIndex = 0;
hr = pWriter->AddStream(pType2, pdwStreamIndex);
}

pType2->Release();

return hr;
}

HRESULT CopyAttribute(IMFAttributes *pSrc, IMFAttributes *pDest, const GUID key)
{
PROPVARIANT var;
PropVariantInit(var);

HRESULT hr = S_OK;

hr = pSrc->GetItem(key, var);
if (SUCCEEDED(hr))
{
hr = pDest->SetItem(key, var);
}

PropVariantClear(var);
return hr;
}

One thing that’s missing is audio. I’ve got the video into the .mp4 file but I need an audio stream in there as well.

The next step is to get audio in and then try and check that the media file will be understood by a different video softphone, probably Counterpath’s Bria since I already have that installed.

It feels like I’ve made a lot of progress from in the last few days although reflecting on what I’ve achieved I actually haven’t got much closer to the softphone goal. My two accomplishments, which seemed exciting at the time, were:

  • Successfully get a list of all the video modes that my webcam supports,
  • Get a video stream from my webcam and save a single frame as a bitmap.

The first step was to get an IMFSourceReader from the IMFMediaSource (my webcam) I created in the part II. My understanding of the way these two interfaces work is that IMFMediaSource is implemented by a class that wraps a device, file, network stream etc. that is capable of providing some audio or video and IMFSourceReader by the class that knows how to read samples from the media source.

The code I used to list my webcam’s video modes is shown below.

// Initialize the Media Foundation platform.
hr = MFStartup(MF_VERSION);
if (SUCCEEDED(hr))
{
    // Create the source reader.
    IMFSourceReader *pReader;
 
    hr = MFCreateSourceReaderFromMediaSource(*ppSource,	pConfig, &amp;amp;pReader);

    if (SUCCEEDED(hr))
    {
        while (SUCCEEDED(hr))
       {
            IMFMediaType *pType = NULL;
            hr = pReader->GetNativeMediaType(0, dwMediaTypeIndex, &pType);
            if (hr == MF_E_NO_MORE_TYPES)
            {
                hr = S_OK;
                break;
            }
            else if (SUCCEEDED(hr))
            {
                // Examine the media type. 
                CMediaTypeTrace *nativeTypeMediaTrace = new CMediaTypeTrace(pType);
                printf("Native media type: %s.n", nativeTypeMediaTrace->GetString());
                pType->Release();
            }

            ++dwMediaTypeIndex;
        }
    }
}

The code in the snippet is just using the standard except for the CMediaTypeTrace class. That’s actually the useful class since it takes the IMFMediaType, which is mostly a bunch of GUIDs that map to constants to describe one of the webcam’s modes, and spits out some plain English to represent the resolution, format etc. of the webcam’s mode. The CMediaTypeTrace class is not actually in the Media Foundation library and instead is provided in mediatypetrace.h which is in one of the samples in the MediaFoundation directory that comes with the Windows SDK (on my system it’s in Windowsv7.1Samplesmultimediamediafoundationtopoedittedutil). As it happens the two video modes that my camera supports, RGB24 and I420, were not included in the list of GUIDs in mediatypetrace.h so I had to search around the place to find what they were and then add them in.

LPCSTR STRING_FROM_GUID( GUID Attr )
{
    ...
    INTERNAL_GUID_TO_STRING( MFVideoFormat_RGB24, 14 );	  // RGB24
    INTERNAL_GUID_TO_STRING( WMMEDIASUBTYPE_I420, 15 );   // I420
}

A full list of the modes my web cam supports are listed below.

Device Name: Logitech QuickCam Pro 9000.
Current media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 640, H: 480.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 640, H: 480.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 160, H: 90.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 160, H: 100.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 160, H: 120.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 176, H: 144.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 320, H: 180.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 320, H: 200.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 320, H: 240.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 352, H: 288.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 640, H: 360.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 640, H: 400.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 864, H: 480.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 768, H: 480.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 800, H: 450.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 800, H: 500.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 800, H: 600.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 960, H: 720.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 1280, H: 720.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 1280, H: 800.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 1280, H: 1024.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 1600, H: 900.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 1600, H: 1000.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=RGB24, FRAME_SIZE=W 1600, H: 1200.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 640, H: 480.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 160, H: 90.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 160, H: 100.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 160, H: 120.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 176, H: 144.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 320, H: 180.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 320, H: 200.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 320, H: 240.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 352, H: 288.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 640, H: 360.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 640, H: 400.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 864, H: 480.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 768, H: 480.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 800, H: 450.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 800, H: 500.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 800, H: 600.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 960, H: 720.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 1280, H: 720.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 1280, H: 800.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 1280, H: 1024.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 1600, H: 900.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 1600, H: 1000.
Native media type: Video: MAJOR_TYPE=Video, SUBTYPE=I420, FRAME_SIZE=W 1600, H: 1200.

The second thing I was able to do was to take a sample from my webcam and save it as a bitmap. To do this I took a lot some short-cuts, namely hard coding the size of the sample, which I know from my webcam’s default mode (640 x 480), and relying on the fact that that mode does not result in any padding (I’m not 100% on that and have taken an educated guess). I found someone else’s sample that created a bitmap file and blatantly copied it. Below is the code I used to extract the sample and save the bitmap.

// Initialize the Media Foundation platform.
hr = MFStartup(MF_VERSION);
if (SUCCEEDED(hr))
{
	// Create the source reader.
	IMFSourceReader *pReader;

	hr = MFCreateSourceReaderFromMediaSource(
		*ppSource,
		pConfig,
		&amp;amp;pReader);

	//GetCurrentMediaType(pReader);
	//ListModes(pReader);
				
	DWORD streamIndex, flags;
	LONGLONG llTimeStamp;
	IMFSample *pSample = NULL;

	while(!pSample)
	{
		// Initial read results in a null pSample??
		hr = pReader-&amp;gt;ReadSample(
			MF_SOURCE_READER_ANY_STREAM,    // Stream index.
			0,                              // Flags.
			&amp;amp;streamIndex,                   // Receives the actual stream index. 
			&amp;amp;flags,                         // Receives status flags.
			&amp;amp;llTimeStamp,                   // Receives the time stamp.
			&amp;amp;pSample                        // Receives the sample or NULL.
			);

		wprintf(L&amp;quot;Stream %d (%I64d)n&amp;quot;, streamIndex, llTimeStamp);
	}

	// Use non-2D version of sample.
	IMFMediaBuffer *mediaBuffer = NULL;
	BYTE *pData = NULL;
	DWORD writePosn = 0;

	pSample-&amp;gt;ConvertToContiguousBuffer(&amp;amp;mediaBuffer);

	hr = mediaBuffer-&amp;gt;Lock(&amp;amp;pData, NULL, NULL);

	HANDLE file = CreateBitmapFile(&amp;amp;writePosn);

	WriteFile(file, pData, 640 * 480 * (24/8), &amp;amp;writePosn, NULL);

	CloseHandle(file);

	mediaBuffer-&amp;gt;Unlock();

	// Shut down Media Foundation.
	MFShutdown();
}

HANDLE CreateBitmapFile(DWORD *writePosn)
{
	HANDLE file;
	BITMAPFILEHEADER fileHeader;
	BITMAPINFOHEADER fileInfo;
	//DWORD write = 0;
 
	file = CreateFile(L&amp;quot;sample.bmp&amp;quot;,GENERIC_WRITE,0,NULL,CREATE_ALWAYS,FILE_ATTRIBUTE_NORMAL,NULL);  //Sets up the new bmp to be written to
 
	fileHeader.bfType = 19778;                                                                    //Sets our type to BM or bmp
	fileHeader.bfSize = sizeof(fileHeader.bfOffBits) + sizeof(RGBTRIPLE);                                                //Sets the size equal to the size of the header struct
	fileHeader.bfReserved1 = 0;                                                                    //sets the reserves to 0
	fileHeader.bfReserved2 = 0;
	fileHeader.bfOffBits = sizeof(BITMAPFILEHEADER)+sizeof(BITMAPINFOHEADER);                    //Sets offbits equal to the size of file and info header
 
	fileInfo.biSize = sizeof(BITMAPINFOHEADER);
	fileInfo.biWidth = 640;
	fileInfo.biHeight = 480;
	fileInfo.biPlanes = 1;
	fileInfo.biBitCount = 24;
	fileInfo.biCompression = BI_RGB;
	fileInfo.biSizeImage = 640 * 480 * (24/8);
	fileInfo.biXPelsPerMeter = 2400;
	fileInfo.biYPelsPerMeter = 2400;
	fileInfo.biClrImportant = 0;
	fileInfo.biClrUsed = 0;
 
	WriteFile(file, &amp;amp;fileHeader, sizeof(fileHeader), writePosn, NULL);
	WriteFile(file, &amp;amp;fileInfo, sizeof(fileInfo), writePosn, NULL);

	return file;
}

So that was all fun but it hasn’t gotten me much closer to have an H.264 stream ready for bundling into my RTP packets. Getting the H.264 stream will be my next focus. I think I’ll try capturing it to an .mp4 file as a first step. Actually I wonder if there’s a way I can test an .mp4 file with a softphone and VLC? That would be a handy way to test if the H.264 stream I get is actually going to work when I use it in a VoIP call.

I also ordered Developing Microsoft Media Foundation Applications from Amazon thinking it might help only me on this journey only to find it available for free online a couple of days later :(.

Got the video device enumeration code working, at least well enough for it to tell me my webcam is a Logitech QuickCam Pro 9000. The working code is below.

#include &quot;stdafx.h&quot;
#include &lt;mfapi.h&gt;
#include &lt;mfplay.h&gt;
#include &quot;common.h&quot;

HRESULT CreateVideoDeviceSource(IMFMediaSource **ppSource);

int _tmain(int argc, _TCHAR* argv[])
{
	printf(&quot;Get webcam properties test console.n&quot;);

	CoInitializeEx(NULL, COINIT_APARTMENTTHREADED | COINIT_DISABLE_OLE1DDE);
	
	IMFMediaSource *ppSource = NULL;

    CreateVideoDeviceSource(&amp;ppSource);

	getchar();

	return 0;
}

HRESULT CreateVideoDeviceSource(IMFMediaSource **ppSource)
{
     *ppSource = NULL;

    UINT32 count = 0;

    IMFAttributes *pConfig = NULL;
    IMFActivate **ppDevices = NULL;

    // Create an attribute store to hold the search criteria.
    HRESULT hr = MFCreateAttributes(&amp;pConfig, 1);

    // Request video capture devices.
    if (SUCCEEDED(hr))
    {
        hr = pConfig-&gt;SetGUID(
            MF_DEVSOURCE_ATTRIBUTE_SOURCE_TYPE, 
            MF_DEVSOURCE_ATTRIBUTE_SOURCE_TYPE_VIDCAP_GUID
            );
    }

    // Enumerate the devices,
    if (SUCCEEDED(hr))
    {
        hr = MFEnumDeviceSources(pConfig, &amp;ppDevices, &amp;count);
    }

	printf(&quot;Device Count: %i.n&quot;, count);

    // Create a media source for the first device in the list.
    if (SUCCEEDED(hr))
    {
        if (count &gt; 0)
        {
            hr = ppDevices[0]-&gt;ActivateObject(IID_PPV_ARGS(ppSource));

			if (SUCCEEDED(hr))
			{
				WCHAR *szFriendlyName = NULL;
    
				// Try to get the display name.
				UINT32 cchName;
				hr = ppDevices[0]-&gt;GetAllocatedString(
					MF_DEVSOURCE_ATTRIBUTE_FRIENDLY_NAME,
					&amp;szFriendlyName, &amp;cchName);

				if (SUCCEEDED(hr))
				{
					wprintf(L&quot;Device Name: %s.n&quot;, szFriendlyName);
				}
				else
				{
					printf(&quot;Error getting device attribute.&quot;);
				}

				CoTaskMemFree(szFriendlyName);
			}
		}
        else
        {
            hr = MF_E_NOT_FOUND;
        }
    }

    for (DWORD i = 0; i &lt; count; i++)
    {
        ppDevices[i]-&gt;Release();
    }
    CoTaskMemFree(ppDevices);
    return hr;
}

The problem I had previously was a missing call to CoInitializeEx. It seems it’s needed to initialise things to allow the user of COM libraries.

The next step is now to work out how to get a sample from the webcam.

This is the first post in what will hopefully be a successful series of posts detailing how I manage to build a video capable softphone using Windows Media Foundation.

I’d never heard of Media Foundation until last week so not only do I not know how to use it I also don’t know it will be suitable for the task. I do know it is the successor to the  Windows DirectShow API but does not yet provide the same coverage so I may have to delve into the DirectShow API as well. On top of that neither API has a comprehensive managed .Net interface so that means the job needs to be done in C++. My C++ skills are severely undernourished so I’m expecting it to take a while to get up to speed before I can start really diving into the APIs.

What I do have is a basic working softphone that I can build on which means I can focus on the video side of things. My goal is to be able to place a SIP H.264 video call with my webcam to another video softphone, such as Counterpath’s Bria. Given other things going on at the moment, such as a 7 week old baby and a 3 year old, I’m estimating the project could take 2 to 3 months. As to why I’m interested in this it’s because it’s something different from both .Net and SIP. I’ve been working with both those for a long time so taking a break and playing with something different but still related is appealing.

Enough chit chat, getting started…

1. The first thing I’ve done is to install the Windows SDK for Windows 7 and take a look at the Media Foundation sample projects. The first sample I tried was the SimpleCapture project and it ran fine out of the box.

2. After looking through a few more of the samples I feel the need to get coding. Being able to get a video stream from my webcam is the obvious place to start. I’ve created a C++ Win32 console application and found an article which discusses enumerating the system’s video capture devices. I haven’t gotten very far as yet but I’m now wondering if my Logitech Webcam Pro 9000 driver supports H.264 meaning I wouldn’t need to use any of the Media Foundation H.264 codec capabilities? A quick look at the camera’s specification page and I’m pretty sure the answer is no.

3. I’ve now got the sample compiling and running but the count of my video devices is coming back as 0 🙁 so I’ve probably got some flags wrong somewhere.

The first couple of hours hasn’t got me very far yet. More tomorrow.