Kinect in Motion – An Overview

Motion control computing has been establishing itself as one of the most relevant techniques for designing and implementing a Natural User Interface (NUI).

NUIs are human-machine interfaces that enable the user to interact in a natural way with software systems. The goals of NUIs are to be natural and intuitive. NUIs are built on the following two main principles:

  • The NUI has to be imperceptible, thanks to its intuitive characteristics: (a sensor able to capture our gestures, a microphone able to capture our voice, and a touch screen able to capture our hands' movements). All these interfaces are imperceptible to us because their use is intuitive. The interface is not distracting us from the core functionalities of our software system.
  • The NUI is based on nature or natural elements. (the slide gesture, the touch, the body movements, the voice commands—all these actions are natural and not diverting from our normal behavior).

NUIs are becoming crucial for increasing and enhancing the user accessibility for software solution. Programming a NUI is very important nowadays and it will continue to evolve in the future.

Kinect embraces the NUIs principle and provides a powerful multimodal interface to the user. We can interact with complex software applications and/or video games simply by using our voice and our natural gestures. Kinect can detect our body position, velocity of our movements, and our voice commands. It can detect objects' position too.

Microsoft started to develop Kinect as a secret project in 2006 within the Xbox division as a competitive Wii killer. In 2008, Microsoft started Project Natal, named after the Microsoft General Manager of Incubation Alex Kipman's hometown in Brazil. The project's goal was to develop a device including depth recognition, motion tracking, facial recognition, and speech recognition based on the video recognition technology developed by PrimeSense.

Kinect for Xbox was launched in November 2010 and its launch was indeed a success: it was and it is still a break-through in the gaming world and it holds the Guinness World Record for being the "fastest selling consumer electronics device" ahead of the iPhone and the iPad.

In December 2010, PrimeSense ( released a set of open source drivers and APIs for Kinect that enabled software developers to develop Windows applications using the Kinect sensor.

Finally, on June 17 2011 Microsoft launched the Kinect SDK beta, which is a set of libraries and APIs that enable us to design and develop software applications on Microsoft platforms using the Kinect sensor as a multimodal interface.

With the launch of the Kinect for Windows device and the Kinect SDK, motion control computing is now a discipline that we can shape in our garages, writing simple and powerful software applications ourselves.

This article is written for all of us who want to develop market-ready software applications using Kinect for Windows that can track audio and video and control motion based on NUI. In an area where Kinect established itself in such a short span of time, there is the need to consolidate all the technical resources and develop them in an appropriate way: this is our zero-to-hero Kinect in motion journey. This is what this book is about.

The aim of this article is to understand the steps for capturing data from the color stream, depth stream, and IR stream data. The key learning tools and steps for mastering all these streams are:

  • color camera: data stream, event driven and polling techniques to manage color frames, image editing, color image tuning, and color image formats
  • depth image: data stream, depth image ranges, and mapping between color image and depth image

All the examples we will develop in this book are built on Visual Studio 2010 or 2012. In this introduction, we want to include the key steps for getting started.

From Visual Studio, select File | New | Project. In the New Project window, do the following:

  1. Select the WPF Application Visual C# template.
  2. Select the .Net Framework 4.0 as the framework for the project (it works in .Net Framework 4.5 too).
  3. Assign a name to the project
  4. Choose a location for the project.
  5. Leave all the other settings with the default value.
  6. Click on the OK button.

In the Solution Explorer window, please locate the references of the project. Right-click on References and select Add Reference to invoke the Reference Manager window. Select the Microsoft.Kinect Version assembly and click on the OK button.

Skeletal tracking allows applications to recognize people and follow their actions. Skeletal tracking combined with gesture-based programming enables applications to provide a natural interface and increase the usability and ease of the application itself.

In this article we will learn how to enable and handle the skeleton data stream. For instance, we will address the following:

  • Tracking users by analyzing the skeleton data streamed by Kinect and mapping them to the color stream
  • Understanding what joints are and which joints are tracked in the near and seated mode
  • Observing the movements of the tracked users to detect simple actions

Mastering the skeleton data stream enables us to implement an application by tracking the user's actions and to recognize the user's gestures.

The Kinect sensor, thanks to the IR camera, can recognize up to six users in its field of view. Of these, only up to two users can be fully tracked, while the others are tracked from one single point only, as demonstrated in the following image:

In this article we will explicate how to use the Kinect sensor's speech recognition capability as an additional natural interface modality in our applications. Speech recognition is a powerful interface that increases the adoption of software solutions by users with disabilities. Speech recognition can be used in working environments where the user can perform his/her job or task away from a traditional workstation.

The Microsoft Kinect SDK setup process includes the installation of the speech recognition components.

The Kinect sensor is equipped with one array of four microphone devices.

The array of microphones can be handled using the code libraries released by Microsoft since Windows Vista. These libraries include Voice Capture DirectX Media Object (DMO) and the Speech Recognition API (SAPI).

In managed code, Kinect SDK v1.6 provides a wrapper extending the Voice Capture DMO. Thanks to the Voice Capture DMO, Kinect provides capabilities such as:

  • Acoustic echo cancellation (AEC)
  • Automatic gain control (AGC)
  • Noise suppression

The Speech Recognition API is the development library that allows us to use the built-in speech recognition capabilities of the operating system while developing our custom application. These APIs can be used with or without the Kinect sensor and its SDK.

We walked through the journey on how to manage and master all the data streamed out from the Kinect sensor, starting from managing the depth and color stream to implementing natural user interface enabled applications based on gestures and speech recognition.

While implementing the proposed examples, we have been standing up, walking in our room or office, and letting our colleagues or friends wonder what we were doing!

Of course, we would never like to discourage doing physical exercises and talking to our Kinect sensor, but having said so, there are in fact scenarios where we need to be close to the keyboard. For instance, when things go wrong and we cry for a passionate look through the source code flow (does that sound like a romantic way to explain debugging?). Moving back and forward from our keyboard limits our ability to spot issues. What about when we have to process the same stream of data over and over again, or, in a development team type of scenario, when we have to unit test the application in a repetitive manner?

In this article we will learn how we can save time coding and testing on Kinect enabled applications by:

  • Recording all the video data coming into an application from a Kinect sensor with Kinect Studio
  • Injecting the recorded video in an application allowing us to test our code without getting out of our chair over and over again
  • Saving and playing back voice commands with a simple custom tool for enforcing quality on our application’s speech recognition capabilities


In this Article, we touched upon the essential concepts of Kinect and its varied operations.

Resources for Article :

Further resources on this subject:

You've been reading an excerpt of:

Kinect in Motion – Audio and Visual Tracking by Example

Explore Title