Marker-based Augmented Reality on iPhone or iPad

This is the definitive advanced tutorial for OpenCV, designed for those with basic C++ skills. The computer vision projects are divided into easily assimilated chapters with an emphasis on practical involvement for an easier learning curve.

(For more resources related to this topic, see here.)

Creating an iOS project that uses OpenCV

In this section we will create a demo application for iPhone/iPad devices that will use the OpenCV ( Open Source Computer Vision ) library to detect markers in the camera frame and render 3D objects on it. This example will show you how to get access to the raw video data stream from the device camera, perform image processing using the OpenCV library, find a marker in an image, and render an AR overlay.

We will start by first creating a new XCode project by choosing the iOS Single View Application template, as shown in the following screenshot:

Now we have to add OpenCV to our project. This step is necessary because in this application we will use a lot of functions from this library to detect markers and estimate position position.

OpenCV is a library of programming functions for real-time computer vision. It was originally developed by Intel and is now supported by Willow Garage and Itseez. This library is written in C and C++ languages. It also has an official Python binding and unofficial bindings to Java and .NET languages.

Adding OpenCV framework

Fortunately the library is cross-platform, so it can be used on iOS devices. Starting from version 2.4.2, OpenCV library is officially supported on the iOS platform and you can download the distribution package from the library website at The OpenCV for iOS link points to the compressed OpenCV framework. Don't worry if you are new to iOS development; a framework is like a bundle of files. Usually each framework package contains a list of header files and list of statically linked libraries. Application frameworks provide an easy way to distribute precompiled libraries to developers.

Of course, you can build your own libraries from scratch. OpenCV documentation explains this process in detail. For simplicity, we follow the recommended way and use the framework for this article.

After downloading the file we extract its content to the project folder, as shown in the following screenshot:

To inform the XCode IDE to use any framework during the build stage, click on Project options and locate the Build phases tab. From there we can add or remove the list of frameworks involved in the build process. Click on the plus sign to add a new framework, as shown in the following screenshot:

From here we can choose from a list of standard frameworks. But to add a custom framework we should click on the Add other button. The open file dialog box will appear. Point it to opencv2.framework in the project folder as shown in the following screenshot:

Including OpenCV headers

Now that we have added the OpenCV framework to the project, everything is almost done. One last thing—let's add OpenCV headers to the project's precompiled headers. The precompiled headers are a great feature to speed up compilation time. By adding OpenCV headers to them, all your sources automatically include OpenCV headers as well. Find a .pch file in the project source tree and modify it in the following way.

The following code shows how to modify the .pch file in the project source tree:

// // Prefix header for all source files of the 'Example_MarkerBasedAR' // #import <Availability.h> #ifndef __IPHONE_5_0 #warning "This project uses features only available in iOS SDK 5.0 and later." #endif #ifdef __cplusplus #include <opencv2/opencv.hpp> #endif #ifdef __OBJC__ #import <UIKit/UIKit.h> #import <Foundation/Foundation.h> #endif

Now you can call any OpenCV function from any place in your project.

That's all. Our project template is configured and we are ready to move further. Free advice: make a copy of this project; this will save you time when you are creating your next one!

Application architecture

Each iOS application contains at least one instance of the UIViewController interface that handles all view events and manages the application's business logic. This class provides the fundamental view-management model for all iOS apps. A view controller manages a set of views that make up a portion of your app's user interface. As part of the controller layer of your app, a view controller coordinates its efforts with model objects and other controller objects—including other view controllers—so your app presents a single coherent user interface.

The application that we are going to write will have only one view; that's why we choose a Single-View Application template to create one. This view will be used to present the rendered picture. Our ViewController class will contain three major components that each AR application should have (see the next diagram):

  • Video source

  • Processing pipeline

  • Visualization engine

The video source is responsible for providing new frames taken from the built-in camera to the user code. This means that the video source should be capable of choosing a camera device (front- or back-facing camera), adjusting its parameters (such as resolution of the captured video, white balance, and shutter speed), and grabbing frames without freezing the main UI.

The image processing routine will be encapsulated in the MarkerDetector class. This class provides a very thin interface to user code. Usually it's a set of functions like processFrame and getResult. Actually that's all that ViewController should know about. We must not expose low-level data structures and algorithms to the view layer without strong necessity. VisualizationController contains all logic concerned with visualization of the Augmented Reality on our view. VisualizationController is also a facade that hides a particular implementation of the rendering engine. Low code coherence gives us freedom to change these components without the need to rewrite the rest of your code.

Such an approach gives you the freedom to use independent modules on other platforms and compilers as well. For example, you can use the MarkerDetector class easily to develop desktop applications on Mac, Windows, and Linux systems without any changes to the code. Likewise, you can decide to port VisualizationController on the Windows platform and use Direct3D for rendering. In this case you should write only new VisualizationController implementation; other code parts will remain the same.

The main processing routine starts from receiving a new frame from the video source. This triggers video source to inform the user code about this event with a callback. ViewController handles this callback and performs the following operations:

  1. Sends a new frame to the visualization controller.

  2. Performs processing of the new frame using our pipeline.

  3. Sends the detected markers to the visualization stage.

  4. Renders a scene.

Let's examine this routine in detail. The rendering of an AR scene includes the drawing of a background image that has a content of the last received frame; artificial 3D objects are drawn later on. When we send a new frame for visualization, we are copying image data to internal buffers of the rendering engine. This is not actual rendering yet; we are just updating the text with a new bitmap.

The second step is the processing of new frame and marker detection. We pass our image as input and as a result receive a list of the markers detected. on it. These markers are passed to the visualization controller, which knows how to deal with them. Let's take a look at the following sequence diagram where this routine is shown:

We start development by writing a video capture component. This class will be responsible for all frame grabbing and for sending notifications of captured frames via user callback. Later on we will write a marker detection algorithm. This detection routine is the core of your application. In this part of our program we will use a lot of OpenCV functions to process images, detect contours on them, find marker rectangles, and estimate their position. After that we will concentrate on visualization of our results using Augmented Reality. After bringing all these things together we will complete our first AR application. So let's move on!

Accessing the camera

The Augmented Reality application is impossible to create without two major things: video capturing and AR visualization. The video capture stage consists of receiving frames from the device camera, performing necessary color conversion, and sending it to the processing pipeline. As the single frame processing time is so critical to AR applications, the capture process should be as efficient as possible. The best way to achieve maximum performance is to have direct access to the frames received from the camera. This became possible starting from iOS Version 4. Existing APIs from the AVFoundation framework provide the necessary functionality to read directly from image buffers in memory.

You can find a lot of examples that use the AVCaptureVideoPreviewLayer class and the UIGetScreenImage function to capture videos from the camera. This technique was used for iOS Version 3 and earlier. It has now become outdated and has two major disadvantages:

  • Lack of direct access to frame data. To get a bitmap, you have to create an intermediate instance of UIImage, copy an image to it, and get it back. For AR applications this price is too high, because each millisecond matters. Losing a few frames per second (FPS) significantly decreases overall user experience.

  • To draw an AR, you have to add a transparent overlay view that will present the AR. Referring to Apple guidelines, you should avoid non-opaque layers because their blending is hard for mobile processors.

Classes AVCaptureDevice and AVCaptureVideoDataOutput allow you to configure, capture, and specify unprocessed video frames in 32 bpp BGRA format. Also you can set up the desired resolution of output frames. However, it does affect overall performance since the larger the frame the more processing time and memory is required.

There is a good alternative for high-performance video capture. The AVFoundation API offers a much faster and more elegant way to grab frames directly from the camera. But first, let's take a look at the following figure where the capturing process for iOS is shown:

AVCaptureSession is a root capture object that we should create. Capture session requires two components—an input and an output. The input device can either be a physical device (camera) or a video file (not shown in diagram). In our case it's a built-in camera (front or back). The output device can be presented by one of the following interfaces:

  • AVCaptureMovieFileOutput

  • AVCaptureStillImageOutput

  • AVCaptureVideoPreviewLayer

  • AVCaptureVideoDataOutput

The AVCaptureMovieFileOutput interface is used to record video to the file, the AVCaptureStillImageOutput interface is used to to make still images, and the AVCaptureVideoPreviewLayer interface is used to play a video preview on the screen. We are interested in the AVCaptureVideoDataOutput interface because it gives you direct access to video data.

The iOS platform is built on top of the Objective-C programming language. So to work with AVFoundation framework, our class also has to be written in Objective-C. In this section all code listings are in the Objective-C++ language.

To encapsulate the video capturing process, we create the VideoSource interface as shown by the following code:

@protocol VideoSourceDelegate<NSObject> -(void)frameReady:(BGRAVideoFrame) frame; @end @interface VideoSource : NSObject<AVCaptureVideoDataOutputSampleBuffe rDelegate> { } @property (nonatomic, retain) AVCaptureSession *captureSession; @property (nonatomic, retain) AVCaptureDeviceInput *deviceInput; @property (nonatomic, retain) id<VideoSourceDelegate> delegate; - (bool) startWithDevicePosition:(AVCaptureDevicePosition) devicePosition; - (CameraCalibration) getCalibration; - (CGSize) getFrameSize; @end

In this callback we lock the image buffer to prevent modifications by any new frames, obtain a pointer to the image data and frame dimensions. Then we construct temporary BGRAVideoFrame object that is passed to outside via special delegate. This delegate has following prototype:

@protocol VideoSourceDelegate<NSObject> -(void)frameReady:(BGRAVideoFrame) frame; @end

Within VideoSourceDelegate, the VideoSource interface informs the user code that a new frame is available.

The step-by-step guide for the initialization of video capture is listed as follows:

  1. Create an instance of AVCaptureSession and set the capture session quality preset.

  2. Choose and create AVCaptureDevice. You can choose the front- or backfacing camera or use the default one.

  3. Initialize AVCaptureDeviceInput using the created capture device and add it to the capture session.

  4. Create an instance of AVCaptureVideoDataOutput and initialize it with format of video frame, callback delegate, and dispatch the queue.

  5. Add the capture output to the capture session object.

  6. Start the capture session.

Let's explain some of these steps in more detail. After creating the capture session, we can specify the desired quality preset to ensure that we will obtain optimal performance. We don't need to process HD-quality video, so 640 x 480 or an even lesser frame resolution is a good choice:

- (id)init { if ((self = [super init])) { AVCaptureSession * capSession = [[AVCaptureSession alloc] init]; if ([capSession canSetSessionPreset:AVCaptureSessionPreset64 0x480]) { [capSession setSessionPreset:AVCaptureSessionPreset640x480]; NSLog(@"Set capture session preset AVCaptureSessionPreset640x480"); } else if ([capSession canSetSessionPreset:AVCaptureSessionPresetL ow]) { [capSession setSessionPreset:AVCaptureSessionPresetLow]; NSLog(@"Set capture session preset AVCaptureSessionPresetLow"); } self.captureSession = capSession; } return self; }

Always check hardware capabilities using the appropriate API; there is no guarantee that every camera will be capable of setting a particular session preset.

After creating the capture session, we should add the capture input—the instance of AVCaptureDeviceInput will represent a physical camera device. The cameraWithPosition function is a helper function that returns the camera device for the requested position (front, back, or default):

- (bool) startWithDevicePosition:(AVCaptureDevicePosition) devicePosition { AVCaptureDevice *videoDevice = [self cameraWithPosition:devicePosit ion]; if (!videoDevice) return FALSE; { NSError *error; AVCaptureDeviceInput *videoIn = [AVCaptureDeviceInput deviceInputWithDevice:videoDevice error:&error]; self.deviceInput = videoIn; if (!error) { if ([[self captureSession] canAddInput:videoIn]) { [[self captureSession] addInput:videoIn]; } else { NSLog(@"Couldn't add video input"); return FALSE; } } else { NSLog(@"Couldn't create video input"); return FALSE; } } [self addRawViewOutput]; [captureSession startRunning]; return TRUE; }

Please notice the error handling code. Take care of return values for such an important thing as working with hardware setup is a good practice. Without this, your code can crash in unexpected cases without informing the user what has happened.

We created a capture session and added a source of the video frames. Now it's time to add a receiver—an object that will receive actual frame data. The AVCaptureVideoDataOutput class is used to process uncompressed frames from the video stream. The camera can provide frames in BGRA, CMYK, or simple grayscale color models. For our purposes the BGRA color model fits best of all, as we will use this frame for visualization and image processing. The following code shows the addRawViewOutput function:

- (void) addRawViewOutput { /*We setupt the output*/ AVCaptureVideoDataOutput *captureOutput = [[AVCaptureVideoDataOutput alloc] init]; /*While a frame is processes in -captureOutput:didOutputSampleBuff er:fromConnection: delegate methods no other frames are added in the queue. If you don't want this behaviour set the property to NO */ captureOutput.alwaysDiscardsLateVideoFrames = YES; /*We create a serial queue to handle the processing of our frames*/ dispatch_queue_t queue; queue = dispatch_queue_create("com.Example_MarkerBasedAR. cameraQueue", NULL); [captureOutput setSampleBufferDelegate:self queue:queue]; dispatch_release(queue); // Set the video output to store frame in BGRA (It is supposed to be faster) NSString* key = (NSString*)kCVPixelBufferPixelFormatTypeKey; NSNumber* value = [NSNumber numberWithUnsignedInt:kCVPixelFormatType_32BGRA]; NSDictionary* videoSettings = [NSDictionary dictionaryWithObject:value forKey:key]; [captureOutput setVideoSettings:videoSettings]; // Register an output [self.captureSession addOutput:captureOutput]; }

Now the capture session is finally configured. When started, it will capture frames from the camera and send it to user code. When the new frame is available, an AVCaptureSession object performs a captureOutput: didOutputSampleBuffer:fromConnection callback. In this function, we will perform a minor data conversion operation to get the image data in a more usable format and pass it to user code:

- (void)captureOutput:(AVCaptureOutput *)captureOutput didOutputSampleBuffer:(CMSampleBufferRef)sampleBuffer fromConnection:(AVCaptureConnection *)connection { // Get a image buffer holding video frame CVImageBufferRef imageBuffer = CMSampleBufferGetImageBuffer (sampleB uffer); // Lock the image buffer CVPixelBufferLockBaseAddress(imageBuffer,0); // Get information about the image uint8_t *baseAddress = (uint8_t *)CVPixelBufferGetBaseAddress(image Buffer); size_t width = CVPixelBufferGetWidth(imageBuffer); size_t height = CVPixelBufferGetHeight(imageBuffer); size_t stride = CVPixelBufferGetBytesPerRow(imageBuffer); BGRAVideoFrame frame = {width, height, stride, baseAddress}; [delegate frameReady:frame]; /*We unlock the image buffer*/ CVPixelBufferUnlockBaseAddress(imageBuffer,0); }

We obtain a reference to the image buffer that stores our frame data. Then we lock it to prevent modifications by new frames. Now we have exclusive access to the frame data. With help of the CoreVideo API, we get the image dimensions, stride (number of pixels per row), and the pointer to the beginning of the image data.

I draw your attention to the CVPixelBufferLockBaseAddress/ CVPixelBufferUnlockBaseAddress function call in the callback code. Until we hold a lock on the pixel buffer, it guarantees consistency and correctness of its data. Reading of pixels is available only after you have obtained a lock. When you're done, don't forget to unlock it to allow the OS to fill it with new data.

Marker detection

A marker is usually designed as a rectangle image holding black and white areas inside it. Due to known limitations, the marker detection procedure is a simple one. First of all we need to find closed contours on the input image and unwarp the image inside it to a rectangle and then check this against our marker model.

In this sample the 5 x 5 marker will be used. Here is what it looks like:

In the sample project that you will find in this book, the marker detection routine is encapsulated in the MarkerDetector class:

/** * A top-level class that encapsulate marker detector algorithm */ class MarkerDetector { public: /** * Initialize a new instance of marker detector object * @calibration[in] - Camera calibration necessary for pose estimation. */ MarkerDetector(CameraCalibration calibration); void processFrame(const BGRAVideoFrame& frame); const std::vector<Transformation>& getTransformations() const; protected: bool findMarkers(const BGRAVideoFrame& frame, std::vector<Marker>& detectedMarkers); void prepareImage(const cv::Mat& bgraMat, cv::Mat& grayscale); void performThreshold(const cv::Mat& grayscale, cv::Mat& thresholdImg); void findContours(const cv::Mat& thresholdImg, std::vector<std::vector<cv::Point> >& contours, int minContourPointsAllowed); void findMarkerCandidates(const std::vector<std::vector<cv::Point> >& contours, std::vector<Marker>& detectedMarkers); void detectMarkers(const cv::Mat& grayscale, std::vector<Marker>& detectedMarkers); void estimatePosition(std::vector<Marker>& detectedMarkers); private: };

To help you better understand the marker detection routine, a step-by-step processing on one frame from a video will be shown. A source image taken from an iPad camera will be used as an example:

Marker identification

Here is the workflow of the marker detection routine:

  1. Convert the input image to grayscale.

  2. Perform binary threshold operation.

  3. Detect contours.

  4. Search for possible markers.

  5. Detect and decode markers.

  6. Estimate marker 3D pose.

Grayscale conversion

The conversion to grayscale is necessary because markers usually contain only black and white blocks and it's much easier to operate with them on grayscale images. Fortunately, OpenCV color conversion is simple enough.

Please take a look at the following code listing in C++:

void MarkerDetector::prepareImage(const cv::Mat& bgraMat, cv::Mat& grayscale) { // Convert to grayscale cv::cvtColor(bgraMat, grayscale, CV_BGRA2GRAY); }

This function will convert the input BGRA image to grayscale (it will allocate image buffers if necessary) and place the result into the second argument. All further steps will be performed with the grayscale image.

Image binarization

The binarization operation will transform each pixel of our image to black (zero intensity) or white (full intensity). This step is required to find contours. There are several threshold methods; each has strong and weak sides. The easiest and fastest method is absolute threshold. In this method the resulting value depends on current pixel intensity and some threshold value. If pixel intensity is greater than the threshold value, the result will be white (255); otherwise it will be black (0).

This method has a huge disadvantage—it depends on lighting conditions and soft intensity changes. The more preferable method is the adaptive threshold. The major difference of this method is the use of all pixels in given radius around the examined pixel. Using average intensity gives good results and secures more robust corner detection.

The following code snippet shows the MarkerDetector function:

void MarkerDetector::performThreshold(const cv::Mat& grayscale, cv::Mat& thresholdImg) { cv::adaptiveThreshold(grayscale, // Input image thresholdImg,// Result binary image 255, // cv::ADAPTIVE_THRESH_GAUSSIAN_C, // cv::THRESH_BINARY_INV, // 7, // 7 // ); }

After applying adaptive threshold to the input image, the resulting image looks similar to the following one:

Each marker usually looks like a square figure with black and white areas inside it. So the best way to locate a marker is to find closed contours and approximate them with polygons of 4 vertices.

Contours detection

The cv::findCountours function will detect contours on the input binary image:

void MarkerDetector::findContours(const cv::Mat& thresholdImg, std::vector<std::vector<cv::Point> >& contours, int minContourPointsAllowed) { std::vector< std::vector<cv::Point> > allContours; cv::findContours(thresholdImg, allContours, CV_RETR_LIST, CV_ CHAIN_APPROX_NONE); contours.clear(); for (size_t i=0; i<allContours.size(); i++) { int contourSize = allContours[i].size(); if (contourSize > minContourPointsAllowed) { contours.push_back(allContours[i]); } } }

The return value of this function is a list of polygons where each polygon represents a single contour. The function skips contours that have their perimeter in pixels value set to be less than the value of the minContourPointsAllowed variable. This is because we are not interested in small contours. (They will probably contain no marker, or the contour won't be able to be detected due to a small marker size.)

The following figure shows the visualization of detected contours:

Candidates search

After finding contours, the polygon approximation stage is performed. This is done to decrease the number of points that describe the contour shape. It's a good quality check to filter out areas without markers because they can always be represented with a polygon that contains four vertices. If the approximated polygon has more than or fewer than 4 vertices, it's definitely not what we are looking for. The following code implements this idea:

void MarkerDetector::findCandidates ( const ContoursVector& contours, std::vector<Marker>& detectedMarkers ) { std::vector<cv::Point> approxCurve; std::vector<Marker> possibleMarkers; // For each contour, analyze if it is a parallelepiped likely to be the marker for (size_t i=0; i<contours.size(); i++) { // Approximate to a polygon double eps = contours[i].size() * 0.05; cv::approxPolyDP(contours[i], approxCurve, eps, true); // We interested only in polygons that contains only four points if (approxCurve.size() != 4) continue; // And they have to be convex if (!cv::isContourConvex(approxCurve)) continue; // Ensure that the distance between consecutive points is large enough float minDist = std::numeric_limits<float>::max(); for (int i = 0; i < 4; i++) { cv::Point side = approxCurve[i] - approxCurve[(i+1)%4]; float squaredSideLength =; minDist = std::min(minDist, squaredSideLength); } // Check that distance is not very small if (minDist < m_minContourLengthAllowed) continue; // All tests are passed. Save marker candidate: Marker m; for (int i = 0; i<4; i++) m.points.push_back( cv::Point2f(approxCurve[i].x,approxCu rve[i].y) ); // Sort the points in anti-clockwise order // Trace a line between the first and second point. // If the third point is at the right side, then the points are anticlockwise cv::Point v1 = m.points[1] - m.points[0]; cv::Point v2 = m.points[2] - m.points[0]; double o = (v1.x * v2.y) - (v1.y * v2.x); if (o < 0.0) //if the third point is in the left side, then sort in anti-clockwise order std::swap(m.points[1], m.points[3]); possibleMarkers.push_back(m); } // Remove these elements which corners are too close to each other. // First detect candidates for removal: std::vector< std::pair<int,int> > tooNearCandidates; for (size_t i=0;i<possibleMarkers.size();i++) { const Marker& m1 = possibleMarkers[i]; //calculate the average distance of each corner to the nearest corner of the other marker candidate for (size_t j=i+1;j<possibleMarkers.size();j++) { const Marker& m2 = possibleMarkers[j]; float distSquared = 0; for (int c = 0; c < 4; c++) { cv::Point v = m1.points[c] - m2.points[c]; distSquared +=; } distSquared /= 4; if (distSquared < 100) { tooNearCandidates.push_back(std::pair<int,int>(i,j)); } } } // Mark for removal the element of the pair with smaller perimeter std::vector<bool> removalMask (possibleMarkers.size(), false); for (size_t i=0; i<tooNearCandidates.size(); i++) { float p1 = perimeter(possibleMarkers[tooNearCandidates[i]. first ].points); float p2 = perimeter(possibleMarkers[tooNearCandidates[i].second]. points); size_t removalIndex; if (p1 > p2) removalIndex = tooNearCandidates[i].second; else removalIndex = tooNearCandidates[i].first; removalMask[removalIndex] = true; } // Return candidates detectedMarkers.clear(); for (size_t i=0;i<possibleMarkers.size();i++) { if (!removalMask[i]) detectedMarkers.push_back(possibleMarkers[i]); } }

Now we have obtained a list of parallelepipeds that are likely to be the markers. To verify whether they are markers or not, we need to perform three steps:

  1. First, we should remove the perspective projection so as to obtain a frontal view of the rectangle area.

  2. Then we perform thresholding of the image using the Otsu algorithm. This algorithm assumes a bimodal distribution and finds the threshold value that maximizes the extra-class variance while keeping a low intra-class variance.

  3. Finally we perform identification of the marker code. If it is a marker, it has an internal code. The marker is divided into a 7 x 7 grid, of which the internal 5 x 5 cells contain ID information. The rest correspond to the external black border. Here, we first check whether the external black border is present. Then we read the internal 5 x 5 cells and check if they provide a valid code. (It might be required to rotate the code to get the valid one.)

To get the rectangle marker image, we have to unwarp the input image using perspective transformation. This matrix can be calculated with the help of the cv::getPerspectiveTransform function. It finds the perspective transformation from four pairs of corresponding points. The first argument is the marker coordinates in image space and the second point corresponds to the coordinates of the square marker image. Estimated transformation will transform the marker to square form and let us analyze it:

cv::Mat canonicalMarker; Marker& marker = detectedMarkers[i]; // Find the perspective transfomation that brings current marker to rectangular form cv::Mat M = cv::getPerspectiveTransform(marker.points, m_ markerCorners2d); // Transform image to get a canonical marker image cv::warpPerspective(grayscale, canonicalMarker, M, markerSize);

Image warping transforms our image to a rectangle form using perspective transformation:

Now we can test the image to verify if it is a valid marker image. Then we try to extract the bit mask with the marker code. As we expect our marker to contain only black and white colors, we can perform Otsu thresholding to remove gray pixels and leave only black and white pixels:

//threshold image cv::threshold(markerImage, markerImage, 125, 255, cv::THRESH_BINARY | cv::THRESH_OTSU);

Marker code recognition

Each marker has an internal code given by 5 words of 5 bits each. The codification employed is a slight modification of the hamming code. In total, each word has only 2 bits of information out of the 5 bits employed. The other 3 are employed for error detection. As a consequence, we can have up to 1024 different IDs.

The main difference with the hamming code is that the first bit (parity of bits 3 and 5) is inverted. So, ID 0 (which in hamming code is 00000) becomes 10000 in our code. The idea is to prevent a completely black rectangle from being a valid marker ID, with the goal of reducing the likelihood of false positives with objects of the environment.

Counting the number of black and white pixels for each cell gives us a 5 x 5-bit mask with marker code. To count the number of non-zero pixels on a certain image, the cv::countNonZero function is used. This function counts non-zero array elements from a given 1D or 2D array. The cv::Mat type can return a subimage view—a new instance of cv::Mat that contains a portion of the original image. For example, if you have a cv::Mat of size 400 x 400, the following piece of code will create a submatrix for the 50 x 50 image block starting from (10, 10):

cv::Mat src(400,400,CV_8UC1); cv::Rect r(10,10,50,50); cv::Mat subView = src(r);

Reading marker code

Using this technique, we can easily find black and white cells on the marker board:

cv::Mat bitMatrix = cv::Mat::zeros(5,5,CV_8UC1); //get information(for each inner square, determine if it is black or white) for (int y=0;y<5;y++) { for (int x=0;x<5;x++) { int cellX = (x+1)*cellSize; int cellY = (y+1)*cellSize; cv::Mat cell = grey(cv::Rect(cellX,cellY,cellSize,cellSize)); int nZ = cv::countNonZero(cell); if (nZ> (cellSize*cellSize) /2)<uchar>(y,x) = 1; } }

Take a look at the following figure. The same marker can have four possible representations depending on the camera's point of view:

As there are four possible orientations of the marker picture, we have to find the correct marker position. Remember, we introduced three parity bits for each two bits of information. With their help we can find the hamming distance for each possible marker orientation. The correct marker position will have zero hamming distance error, while the other rotations won't.

Here is a code snippet that rotates the bit matrix four times and finds the correct marker orientation:

//check all possible rotations cv::Mat rotations[4]; int distances[4]; rotations[0] = bitMatrix; distances[0] = hammDistMarker(rotations[0]); std::pair<int,int> minDist(distances[0],0); for (int i=1; i<4; i++) { //get the hamming distance to the nearest possible word rotations[i] = rotate(rotations[i-1]); distances[i] = hammDistMarker(rotations[i]); if (distances[i] < minDist.first) { minDist.first = distances[i]; minDist.second = i; } }

This code finds the orientation of the bit matrix in such a way that it gives minimal error for the hamming distance metric. This error should be zero for correct marker ID; if it's not, it means that we encountered a wrong marker pattern (corrupted image or false-positive marker detection).

Marker location refinement

After finding the right marker orientation, we rotate the marker's corners respectively to conform to their order:

//sort the points so that they are always in the same order // no matter the camera orientation std::rotate(marker.points.begin(), marker.points.begin() + 4 - nRotations, marker.points.end());

After detecting a marker and decoding its ID, we will refine its corners. This operation will help us in the next step when we will estimate the marker position in 3D. To find the corner location with subpixel accuracy, the cv::cornerSubPix function is used:

std::vector<cv::Point2f> preciseCorners(4 * goodMarkers.size()); for (size_t i=0; i<goodMarkers.size(); i++) { Marker& marker = goodMarkers[i]; for (int c=0;c<4;c++) { preciseCorners[i*4+c] = marker.points[c]; } } cv::cornerSubPix(grayscale, preciseCorners, cvSize(5,5), cvSize(-1,-1), cvTermCriteria(CV_TERMCRIT_ITER,30,0.1)); //copy back for (size_t i=0;i<goodMarkers.size();i++) { Marker&marker = goodMarkers[i]; for (int c=0;c<4;c++) { marker.points[c] = preciseCorners[i*4+c]; } }

The first step is to prepare the input data for this function. We copy the list of vertices to the input array. Then we call cv::cornerSubPix, passing the actual image, list of points, and set of parameters that affect quality and performance of location refinement. When done, we copy the refined locations back to marker corners as shown in the following image.

We do not use cornerSubPix in the earlier stages of marker detection due to its complexity. It's very expensive to call this function for large numbers of points (in terms of computation time). Therefore we do this only for valid markers.

Placing a marker in 3D

Augmented Reality tries to fuse the real-world object with virtual content. To place a 3D model in a scene, we need to know its pose with regard to a camera that we use to obtain the video frames. We will use a Euclidian transformation in the Cartesian coordinate system to represent such a pose

The position of the marker in 3D and its corresponding projection in 2D is restricted by the following equation:

P = A * [R|T] * M;


  • M denotes a point in a 3D space

  • [R|T] denotes a [3|4] matrix representing a Euclidian transformation

  • A denotes a camera matrix or a matrix of intrinsic parameters

  • P denotes projection of M in screen space

After performing the marker detection step we now know the position of the four marker corners in 2D (projections in screen space). In the next section you will learn how to obtain the A matrix and M vector parameters and calculate the [R|T] transformation.

Camera calibration

Each camera lens has unique parameters, such as focal length, principal point, and lens distortion model. The process of finding intrinsic camera parameters is called camera calibration. The camera calibration process is important for Augmented Reality applications because it describes the perspective transformation and lens distortion on an output image. To achieve the best user experience with Augmented Reality, visualization of an augmented object should be done using the same perspective projection.

To calibrate the camera, we need a special pattern image (chessboard plate or black circles on white background). The camera that is being calibrated takes 10-15 shots of this pattern from different points of view. A calibration algorithm then finds the optimal camera intrinsic parameters and the distortion vector:

To represent camera calibration in our program, we use the CameraCalibration class:

/** * A camera calibration class that stores intrinsic matrix and distorsion coefficients. */ class CameraCalibration { public: CameraCalibration(); CameraCalibration(float fx, float fy, float cx, float cy); CameraCalibration(float fx, float fy, float cx, float cy, float distorsionCoeff[4]); void getMatrix34(float cparam[3][4]) const; const Matrix33& getIntrinsic() const; const Vector4& getDistorsion() const; private: Matrix33 m_intrinsic; Vector4 m_distorsion; };

Detailed explanation of the calibration procedure is beyond the scope of this article. Please refer to OpenCV camera_calibration sample or OpenCV: Estimating Projective Relations in Images at for additional information and source code.

For this sample we provide internal parameters for all modern iOS devices (iPad 2, iPad 3, and iPhone 4).

Marker pose estimation

With the precise location of marker corners, we can estimate a transformation between our camera and a marker in 3D space. This operation is known as pose estimation from 2D-3D correspondences. The pose estimation process finds a Euclidean transformation (that consists only of rotation and translation components) between the camera and the object.

Let's take a look at the following figure:

The C is used to denote the camera center. The P1-P4 points are 3D points in the world coordinate system and the p1-p4 points are their projections on the camera's image plane. Our goal is to find relative transformation between a known marker position in the 3D world (p1-p4) and the camera C using an intrinsic matrix and known point projections on image plane (P1-P4). But where do we get the coordinates of marker position in 3D space? We imagine them. As our marker always has a square form and all vertices lie in one plane, we can define their corners as follows:

We put our marker in the XY plane (Z component is zero) and the marker center corresponds to the (0.0, 0.0, 0.0) point. It's a great hint, because in this case the beginning of our coordinate system will be in the center of the marker (Z axis is perpendicular to the marker plane).

To find the camera location with the known 2D-3D correspondences, the cv::solvePnP function can be used:

void solvePnP(const Mat& objectPoints, const Mat& imagePoints, const Mat& cameraMatrix, const Mat& distCoeffs, Mat& rvec, Mat& tvec, bool useExtrinsicGuess=false);

The objectPoints array is an input array of object points in the object coordinate space. std::vector<cv::Point3f> can be passed here. The OpenCV matrix 3 x N or N x 3, where N is the number of points, can also be passed as an input argument. Here we pass the list of marker coordinates in 3D space (a vector of four points).

The imagePoints array is an array of corresponding image points (or projections). This argument can also be std::vector<cv::Point2f> or cv::Mat of 2 x N or N x 2, where N is the number of points. Here we pass the list of found marker corners.

  • cameraMatrix: This is the 3 x 3 camera intrinsic matrix.

  • distCoeffs: This is the input 4 x 1, 1 x 4, 5 x 1, or 1 x 5 vector of distortion coefficients (k1, k2, p1, p2, [k3]). If it is NULL, all of the distortion coefficients are set to 0.

  • rvec: This is the output rotation vector that (together with tvec) brings points from the model coordinate system to the camera coordinate system.

  • tvec: This is the output translation vector.

  • useExtrinsicGuess: If true, the function will use the provided rvec and tvec vectors as the initial approximations of the rotation and translation vectors, respectively, and will further optimize them.

The function calculates the camera transformation in such a way that it minimizes reprojection error, that is, the sum of squared distances between the observed projection's imagePoints and the projected objectPoints.

The estimated transformation is defined by rotation (rvec) and translation components (tvec). This is also known as Euclidean transformation or rigid transformation.

A rigid transformation is formally defined as a transformation that, when acting on any vector v, produces a transformed vector T(v) of the form:

T(v) = R v + t

where RT = R-1 (that is, R is an orthogonal transformation), and t is a vector giving the translation of the origin. A proper rigid transformation has, in addition,

det(R) = 1

This means that R does not produce a reflection, and hence it represents a rotation (an orientation-preserving orthogonal transformation).

To obtain a 3 x 3 rotation matrix from the rotation vector, the function cv::Rodrigues is used. This function converts a rotation represented by a rotation vector and returns its equivalent rotation matrix.

Because cv::solvePnP finds the camera position with regards to marker pose in 3D space, we have to invert the found transformation. The resulting transformation will describe a marker transformation in the camera coordinate system, which is much friendlier for the rendering process.

Here is a listing of the estimatePosition function, which finds the position of the detected markers:

void MarkerDetector::estimatePosition(std::vector<Marker>& detectedMarkers) { for (size_t i=0; i<detectedMarkers.size(); i++) { Marker& m = detectedMarkers[i]; cv::Mat Rvec; cv::Mat_<float> Tvec; cv::Mat raux,taux; cv::solvePnP(m_markerCorners3d, m.points, camMatrix, distCoeff,raux,taux); raux.convertTo(Rvec,CV_32F); taux.convertTo(Tvec ,CV_32F); cv::Mat_<float> rotMat(3,3); cv::Rodrigues(Rvec, rotMat); // Copy to transformation matrix m.transformation = Transformation(); for (int col=0; col<3; col++) { for (int row=0; row<3; row++) { m.transformation.r().mat[row][col] = rotMat(row,col); // Copy rotation component } m.transformation.t().data[col] = Tvec(col); // Copy translation component } // Since solvePnP finds camera location, w.r.t to marker pose, to get marker pose w.r.t to the camera we invert it. m.transformation = m.transformation.getInverted(); }


In this article we learned how to create a mobile Augmented Reality application for iPhone/iPad devices. You gained knowledge on how to use the OpenCV library within the XCode projects to create stunning state-of-the-art applications.

Usage of OpenCV enables your application to perform complex image processing computations on mobile devices with real-time performance.

From this article you also learned how to perform the initial image processing (translation in shades of gray and binarization), how to find closed contours in the image and approximate them with polygons, how to find markers in the image and decode them, and how to compute the marker position in space.

Resources for Article :

Further resources on this subject:

Books to Consider

comments powered by Disqus

An Introduction to 3D Printing

Explore the future of manufacturing and design  - read our guide to 3d printing for free