We know that all of the objects in the picture below are chairs, right? But show this picture to a computer and see what happens. Getting computers to recognize generic objects is a hugely difficult task that’s complicated by variations of the same object (club chairs vs. office chairs vs. folding chairs) and by other objects in the computer’s field of view that can confuse the machine.

Researchers at iRobot have been working hard on object recognition. iRobot is best known as the company behind the Roomba vacuum cleaner, but they’ve been doing a lot more than just vacc’ing up the dander – these guys are doing serious research and development.

At last week’s GPU Technology Conference, the team from iRobot revealed that they’ve developed the first real-time generic object recognition algorithm. It’s based on the Deformable Part Model (DPM), which is based on the idea that objects are made of parts, and the way those parts are configured in relation to one another is what defines a person from a chair or a car from a boat.

The ‘deformable’ term in the name of this model is key – it refers to the ability of this model to understand that parts of the same object will look very different when viewed from various angles. One example used in the academic paper “Object Detection with Discriminately Trained Part Based Models” (by Felzenszwalb, Birshick, McAllester, and Ramanan) explains how their model can correctly identify a bicycle when viewed head-on or from the side.

Not surprisingly, the concepts and math that are the foundation of the DPM are highly complex. But that’s what makes the DPM robust when dealing with viewing angles, different scales, cluttered fields of view, and other visual noise that would confound other models. After machines are trained, which consists of showing the model about 1,000 objects in order to learn that particular ‘class’ of objects, they can identify complex objects with a high rate of accuracy. As long as you don’t need to do it quickly, that is.

Running DPM is highly compute-intensive. Each pixel takes around 100,000 floating point operations, with 10,000 loads and 1,000 floats stored. At the image level, a VGA image consumes 10 billion floating point ops, with a billion floats loaded and 100 million stored. That’s 10 Gigaflops for each image, to put it in HPC performance terms. For more context, using LINPACK as a metric, an iPad 2 can crank out 1.65 GFLOP/s, while a middling desktop i5 system can drive around 40 GFLOP/s; so while it’s compute-intensive, it’s not insurmountable.

Not insurmountable, that is, unless you’re talking about HD images, which would ramp up the number of compute operations by at least a factor of four. Or if you’re using a lot of object classes or trying to identify many different objects in the same image. This can increase computational intensity by orders of magnitude, taking it out of realm of problems that can be easily handled with small systems.

To speed up the process and bring it to real time, the iRobot guys looked at various accelerator strategies (GPU vs FPGA vs. DSP) with best efficiency per watt and efficiency per watt gain over time as their goal.  They found that all three technologies were in the same ballpark, but GPUs were easier to develop for, given that they could use C++ code and compile it for CUDA.

In their implementation, they ran almost every function in the GPU, using the CPU primarily for inputting and transferring the image entirely to the GPU. Their first attempt, using 20 different object classes, benchmarked out at 1.5 seconds, which was 40x faster than their CPU-only implementation. However, it was a disappointment when compared with what the CPU+GPU hardware could potentially deliver, at only 6% memory efficiency and 3% computational efficiency. So… back to the drawing board to see what could be done.

The first task was to optimize the kernel and reduce kernel calls – which paid off in a big way. Memory efficiency was increased to 22% of potential, and computational efficiency more than tripled to 10%. Overall, these optimizations dropped the time to identify 20 classes of objects from 1.5 seconds to .6 seconds, a near 3x improvement relative to their first GPU effort, but 100x faster than what they could do with CPUs alone.

To drive home the ‘real time’ part of the their achievement, the iRobot folks showed a video of an experiment in which objects were placed in front of a camera attached to a laptop – not just any laptop, but an Alienware equipped with dual NVIDIA GeForce 580m GPUs. The system worked very quickly, identifying model cars, planes, and other objects often before they were completely out of the hand of the experimenter.


The system could also identify disparate items like Pringles chips and toilet paper when they were placed in cluttered backgrounds, and could correctly identify an object despite seeing only part of it (like the legs and bottom torso of a horse). Compound objects, like a horse and rider, were also quickly labeled.

It’s a great demonstration of just how far machine vision has come. There are many applications for this kind of technology, particularly when it can be accomplished in real time, as demonstrated by iRobot in their GTC 2013 session. It’s far more important and potentially life-changing than the ability to identify Pringles and toilet paper, of course. It would increase point-of-sale accuracy and speed by identifying items that aren’t easily bar coded or scanned, or don’t have RFID chips. Plus it could allow very fast tracking or sorting of objects speeding by on a conveyor belt in a warehouse, while allowing haphazard item placement.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>