How can I determine distance from an object in a video?

I have a video file recorded from the front of a moving vehicle. I am going to use OpenCV for object detection and recognition but I'm stuck on one aspect. How can I determine the distance from a recognized object.

I can know my current speed and real-world GPS position but that is all. I can't make any assumptions about the object I'm tracking. I am planning to use this to track and follow objects without colliding with them. Ideally I would like to use this data to derive the object's real-world position, which I could do if I could determine the distance from the camera to the object.

Your problem's quite standard in the field.

Firstly,

you need to calibrate your camera. This can be done offline (makes life much simpler) or online through self-calibration.

Calibrate it offline - please.

Secondly,

Once you have the calibration matrix of the camera K, determine the projection matrix of the camera in a successive scene (you need to use parallax as mentioned by others). This is described well in this OpenCV tutorial.

You'll have to use the GPS information to find the relative orientation between the cameras in the successive scenes (that might be problematic due to noise inherent in most GPS units), i.e. the R and t mentioned in the tutorial or the rotation and translation between the two cameras.

Once you've resolved all that, you'll have two projection matrices --- representations of the cameras at those successive scenes. Using one of these so-called camera matrices, you can "project" a 3D point M on the scene to the 2D image of the camera on to pixel coordinate m (as in the tutorial).

We will use this to triangulate the real 3D point from 2D points found in your video.

Thirdly,

use an interest point detector to track the same point in your video which lies on the object of interest. There are several detectors available, I recommend SURF since you have OpenCV which also has several other detectors like Shi-Tomasi corners, Harris, etc.

Fourthly,

Once you've tracked points of your object across the sequence and obtained the corresponding 2D pixel coordinates you must triangulate for the best fitting 3D point given your projection matrix and 2D points.

The above image nicely captures the uncertainty and how a best fitting 3D point is computed. Of course in your case, the cameras are probably in front of each other!

Finally,

Once you've obtained the 3D points on the object, you can easily compute the Euclidean distance between the camera center (which is the origin in most cases) and the point.

Note

This is obviously not easy stuff but it's not that hard either. I recommend Hartley and Zisserman's excellent book Multiple View Geometry which has described everything above in explicit detail with MATLAB code to boot.

Have fun and keep asking questions!

When you have moving video, you can use temporal parallax to determine the relative distance of objects. Parallax: (definition).

The effect would be the same we get with our eyes which which can gain depth perception by looking at the same object from slightly different angles. Since you are moving, you can use two successive video frames to get your slightly different angle.

Using parallax calculations, you can determine the relative size and distance of objects (relative to one another). But, if you want the absolute size and distance, you will need a known point of reference.

You will also need to know the speed and direction being traveled (as well as the video frame rate) in order to do the calculations. You might be able to derive the speed of the vehicle using the visual data but that adds another dimension of complexity.

The technology already exists. Satellites determine topographic prominence (height) by comparing multiple images taken over a short period of time. We use parallax to determine the distance of stars by taking photos of night sky at different points in earth's orbit around the sun. I was able to create 3-D images out of an airplane window by taking two photographs within short succession.

The exact technology and calculations (even if I knew them off the top of my head) are way outside the scope of discussing here. If I can find a decent reference, I will post it here.

You need to identify the same points in the same object on two different frames taken a known distance apart. Since you know the location of the camera in each frame, you have a baseline ( the vector between the two camera positions. Construct a triangle from the known baseline and the angles to the identified points. Trigonometry gives you the length of the unknown sides of the traingles for the known length of the baseline and the known angles between the baseline and the unknown sides.

You can use two cameras, or one camera taking successive shots. So, if your vehicle is moving a 1 m/s and you take fames every second, then successibe frames will gibe you a 1m baseline which should be good to measure the distance of objects up to, say, 5m away. If you need to range objects further away than the frames used need to be further apart - however more distant objects will in view for longer.

Observer at F1 sees target at T with angle a1 to velocity vector. Observer moves distance b to F2. Sees target at T with angle a2.

Required to find r1, range from target at F1

The trigonometric identity for cosine gives

Cos( 90 – a1 ) = x / r1 = c1

Cos( 90 - a2 ) = x / r2 = c2

Cos( a1 ) = (b + z) / r1 = c3

Cos( a2 ) = z / r2 = c4

x is distance to target orthogonal to observer’s velocity vector

z is distance from F2 to intersection with x

Solving for r1

r1 = b / ( c3 – c1 . c4 / c2 )

Two cameras so you can detect parallax. It's what humans do.

edit

Please see ravenspoint's answer for more detail. Also, keep in mind that a single camera with a splitter would probably suffice.