Saturday, April 09, 2011

Image-based rendering and some ancient history

Introduction
I've not said much about my work in visual effects in this blog. This is mainly because I try very carefully to avoid any kind of intellectual property conflict with my employer. But now I've left the world of visual effects to work at Google, I think I might throw in the occasional article. So here's something about my own work from many years ago. In particular, back in the year 2000 I and my colleagues George and Kim received an Academy Scientific and Technical Achievement Award for "the development of a system for image-based rendering allowing choreographed camera movements through computer graphic reconstructed sets." I rarely tell people what this work was for. But now I'll have a convenient web page to which I can direct anyone who asks. I'm hoping to aim this at people who know a little mathematics and geometry, but not necessarily anything about computer graphics.

Many years ago I worked at a company called Manex Visual Effects. One day the story of that company needs to be told. Here are some things that have been alleged about Manex that I neither confirm nor deny: its stock was traded in one of the largest ever pump-and-dump scams in the Far East, it extracted millions of dollars from Trenton, New Jersey for a fake scheme to create an East Coast "Hollywood", and it spent a couple of years making press releases about how it was doing work on the Matrix sequels despite the fact that there was nobody at the company who was actually doing work on the movies, including making press releases quoting me describing the ongoing work long after I had left. At one point I was apparently being trailed by a private detective who even came rowing a boat past the back of my house in order to get pictures of me. I haven't checked the fine print, but I think the contracts that prevented me from speaking about any of this expired long ago.

But back around 1998-2000, when Manex was doing real effects work, we developed a pipeline for a technique known as image based rendering. We became adept at taking large numbers of photographs of locations and then reproducing those locations as photorealistic renders. When I say photorealistic here I don't mean something that's merely supposed to look real, but actually looks fake. I mean renders that were indistinguishable from photography. That's commonplace now but back in 2000 it was a challenge, especially on a large scale.

The last ten seconds of this video clip from MI:2 should give some idea of what I'm talking about. The city around Tom Cruise as he jumps from the building is entirely CGI, but using real photography:



Texturing Cities
Although MI:2 isn't set in Sydney, that is the location that was used. A team returned with many hundreds, if not thousands of photographs of the central business district. We also managed to construct 3D models of these buildings by various means. We started by using Façade to construct the geometry, but ultimately we used a variety of methods including buying 3D models from, I think, the city itself. In later years I completely replaced the way we built geometry so we didn't need to use any other source.

The goal was to map the photography onto the models. Here's a picture illustrating three points being mapped onto a building. To render correctly we needed every visible point on every 3D object to be coloured (correct terminology: textured) using a pixel from one of the photographs.

The Matrix
We're ready for a theorem. Let P be a photograph of a scene S. Let (u, v) be ordinary rectilinear coordinates in P. Let (x, y, z) be 3D coordinates in the scene S. Define proj(u, v, w) = (u/w, v/w). Then there is a 3×4 matrix M such that for every point (x, y, z) visible in S, its colour is given by the colour of the point with coordinates proj(M(x,y,z,1)T) in P.

(This assumes a pinhole projection model for the camera. In fact, real cameras have lens distortion. I wrote software to measure and fix this.)

The important point is that for each camera we just needed one matrix. We generated it by a very simple scheme: we had artists mark correspondences between points on our 3D models and points in the photography. I implemented a least squares solver to find the matrix that best fit each view. Pity the artists. They came fresh from college with well developed 3d modelling skills and I was responsible for reducing their careers to this simple point and click operation. But don't feel too bad. Most of these artists have since gone on to very successful careers in visual effects. They were a really great team.

But just mapping a single photograph is no good. The theorem tells us what to do for each point visible in a photograph, but how do we select which photograph to use? When producing the final render for production we had the luxury of being able to allow hours to render a single frame. We could make the decision on a pixel by pixel basis by performing a test for visibility from each camera, and then using heuristics to make the selection. (My colleague George had already proved this worked for the backgrounds to most of the bullet-time shots from The Matrix.)

But that was no good for the artists. They needed to see their work as they were doing it. If we could get the render time down to seconds it would be good. In fact, we got it down to a fraction of a second.

The Splitter
The first thing to note is that if we render a single triangle then the mapping defined by the theorem above is provided by modern graphics hardware. Back in 2000 it wasn't universally available, but it was in the SGI hardware we used. So we only needed to solve the camera selection problem fast. The solution was simple, we'd preprocess the geometry of the scene by splitting it into pieces (reducible to triangles), each piece corresponding a a part visible from one camera. Debevec et al. had already given demos of a technique for approximately partitioning a scene between cameras like this, but it didn't have the quality we wanted.

(By the way, my career in graphics started when I figured out an algorithm for eliminating the division from the proj() function and my new boss-to-be noticed I'd posted it online.)

There were three things needed:

1. For each camera, we needed to compute the subscene that was the intersection of the scene S with the volume viewed by the camera, its frustum.

2. For each subscene we needed to remove any parts occluded from view by any other part of the scene.

3. Ensuring that no two subscenes overlapped - ie. ensuring that no part of any subscene is visible in more than one camera.

It seemed like a messy computational geometry problem. But after a little thought it dawned on me that most of the geometry work, at least as measured by computation time, could be factored into one tiny geometry routine and the rest was logic. Here's a picture of the algorithm:

It takes a convex polygon and slices it using a plane (indicated in grey). Now everything else follows. For example, step 1 above can be achieved like this: the frustum associated with a camera is a 6-sided convex polyhedron (or a 4- or 5-sided polyhedron if you want it to extend all the way to infinity and/or all the way to the point of projection in the camera.) We can decide which points are in the frustum by slicing using the 6 planes and keeping the pieces that fall inside.

Step 2 works a little like this:
There is a 5-sided polygon lying in front of a box casting a shadow. The shadow volume is itself a 6-sided frustum (with a seventh side "at infinity", so to speak). So to remove the parts shadowed by a polygon we use the same slicing algorithm and collect up all the pieces that fall outside of the shadow volume. To remove all of the scene that is occluded from view by this camera we simply remove shadow volume corresponding to every single polygon in the scene. One of the virtues of image based rendering is that the detail in the photography makes up for the simplicity of the geometry, keeping the polygon count low. So the total number of operations might not seem too bad. Unfortunately, every time you slice through the scene you risk doubling the total number of polygons. It could have taken worse than exponential time in the number of polygons. But I took the risk, and surprisingly the time to preprocess was measured in minutes for typical scenes. The individual pieces of geometry were very intricate due to the fact that occlusion from any object could carve into a building. But like a jigsaw, the union of all the pieces gave something that looked just like a real city.

The last stage was ensuring that each part is visible from one camera. This was straightforward. As every polygon was processed with respect to every camera then I could associate to every polygon a list of cameras in which it was visible. At the end I could just sweep through and pick the best camera based on a heuristic. Typically we wanted to use the camera with the most undistorted detail.

Along the way I had to apply a few more heuristics to make things manageable. Many thin slivers would appear in the slicing. If they were small enough I threw them away. I'd also sweep through from time to time and fuse together neighbouring polygons that had been sliced but ended up still being visible in the same cameras, and whose union was still convex. That would reduce the total polygon count and speed up further processing.

It worked. The artists would mark their correspondences, run the optimiser to extract the per-camera matrices, kick off the 'splitter', have a coffee, and then return to a fully interactive view of Sydney, or wherever. They could now examine it from all angles and quickly ascertain what further work was needed, eg. if there were holes in one of the buildings. That allowed us to develop an efficient workflow and carry out the work on the scale needed to complete movie shots. We could also interactively visualise how other objects interacted with our cities. We could also use this tool to plan the photography and ensure we had all the coverage we needed.

In retrospect it all seems trivial. But at the time I don't think any other company could churn out entire city blocks the way we could. Today, Google do similar work on a much larger scale, and much more cleverly, with Street View.

And that was just one part of the technique that we marketed as "virtual cinematography" and which won George, Kim and I our award. But it's really important to remember that movie making is a big team effort. It took the work of many dozens of people to create our photorealistic cities, and the fact that I've chosen to write about my contribution doesn't mean that it was the only part that was important. It wouldn't have happened if George hadn't taught us the method originally, and if Kim hadn't believed in us and formed part of the company around our pipeline. And of course nothing wouldn't have happened if the artists didn't politely put up with our crudely engineered tools.

Ackowledgement
The example city image above was derived from a picture by Calvin Teo on wikipedia. I don't have the original photography we used.

2 comments:

  1. Interesting! I wonder, how did you handle differences in lighting from pictures taken during the day?

    ReplyDelete
  2. Sjoerd,

    At that time I think lighting variations would have been handled by the compositors 'manually' using tools like Shake (http://goo.gl/mqC5L) or Cineon (http://goo.gl/dGcKE). We developed some simple light filtering tools later, but I think we always relied mostly on human skill.

    I experimented with code to automate the process by looking at variations within polygons viewable from more than one camera but I was never able to get reliable results.

    ReplyDelete