Shape From Shading I: Buddha in a Kilobit

This one isn't going to impress anyone but me, but it is something I've been wanting to try for years.

One of the motivating ideas is that a high-quality renderer can be an effective tool for image analysis as well as image generation. A Renderman-class photorealistic renderer can be thought of as a system for reasoning about the optical consequences of a particular arrangement of lights and objects. By forming a hypothesis of what the scene looks like and then rendering it and comparing to the input image, we can form some idea of how plausible our hypothesis is, taking into account shadows and indirect lighting and so forth to whatever degree of realism supported by the renderer.

Another motivating idea is that by analysing a scene in this way we can gain a reasonably accurate idea of the information content of our model, something which is otherwise difficult to do. Explaining this idea will require a detour into a tuppence lecture on the elementary application of Claude Shannon's information theory. (Since I am not a mathematician and have no foggiest clue about the serious mathematics of information theory, elementary applications of its simplest results are as far as I can take the discussion...) Feel free to skip this section:

Shannon defines a 'bit' to be the amount of information required to answer a yes/no question which we were initially completely uncertain about: The answer could just as easily be 'yes' as 'no', so far as we know. "Information" in this sense is a measure of the decrease in our ignorance of the world.
It is essential to distinguish between this information-theoretic meaning of "bit" and the colloquial computer meaning of "bit" as a unit of storage. They are very different, and it is most unfortunate that they wound up using the same word, but all attempts to change the terminology have failed, so we're stuck with it.
To bring the difference into clear focus, note that an empty recordable CD, containing nothing but zeros, has about six billion bits of storage but zero bits of information: Since we know ahead of time that all bits are zero, we learn nothing from examining any of them: Our ignorance of the universe is decreased not at all by reading the contents of the disk.
Shannon's information theory is usually cast in the form of an information stream much like an old-fashioned tickertape: A stream consisting of an indefinitely long sequence of symbols drawn from a fixed, finite "alphabet", emerging one at a time from some source. Depending on the application, the "alphabet" might consist of just the symbols 0 and 1, or it might be the ASCII character set encoding email in English, or it might be bytes of information being recieved from the Pioneer 10 spacecraft.
The basic question addressed is: How many bits of information does such a symbol carry, in general?
Shannon's answer: It carries log2( 1 / p ) bits of information, where p is the a priori probability of seeing that symbol appear at that point in the stream.
So if p is nearly 1.0 -- we were virtually certain that that symbol would appear next -- then the number of bits of information conveyed by it is nearly zero, as expected: It didn't tell us anything we didn't already know.
Conversely, if p is nearly 0.0 -- we were virtually certain that that symbol couldn't possibly appear next -- then the number of bits of information conveyed by it is enormous, as expected: The symbol told us that our understanding of the data stream pattern was very, very wrong, and thus taught us a great deal.
The final useful idea we need from elementary information theory is that such an information stream carries the maximum amount of information when the different symbols are equally probable. For example, if we are playing "Twenty Questions", trying to guess an object by asking twenty yes-or-no questions about it, then information theory tells us that we do best to ask questions which are equally likely to be answered "No" or "Yes": Otherwise, we are wasting bandwidth, learning less than we might per question, and reducing our chances of winning. (Simple arithmetic tells us that twenty such questions allow us to pick out one object from among at most a million possibilities -- suggesting that the world of the children who play this game contains roughly a million psychologically significant objects.)

With that in hand, the idea is to start out with our hierarchical blob, and to iteratively, experimentally reshape it to match the object in a given image. At each cycle in the iteration, we will

Mutate the blob.
Generate an image of the blob.
Subtract that image from our target image.
Discard the mutation unless the current image is closer to the target image than the previous one.

Very simple-minded, but an easy toe in the water to begin experimenting with this idea.

Implementation:

First, here is a teeny Perl script to test how "close" two images are to being identical. It subtracts them using the ImageMagick 'composite' program, and then bzip2's the result (that being my notion of a quick-and-dirty way of squeezing out uninteresting redundancy and arriving at something closer to the true information content) and finally takes the length in bytes of the compressed difference file:

#!/usr/bin/perl -w
use strict;
my $picA = shift;
my $picB = shift;
`/usr/bin/composite -compose Difference $picA $picB xxx.tif`;
`bzip2 xxx.tif`;
my $ls = `ls -l xxx.tif.bz2`;
my ($len_in_bytes ) = ($ls =~ /\d+\D+(\d+)/);
print "$len_in_bytes\n";
`rm xxx.tif.bz2`;

Secondly, I tweaked my 'glob' program (see basic globs) to read and write a state file containing its control points, to allow us to easily mutate a glob, using --savefile=foo.glob and --loadfile=foo.glob command-line switches.

Finally, I wrote a simple little Perl script to implement the above-sketched mutate-render-compare cycle:

#!/usr/bin/perl -w
use strict;

my %parameter = ();

my $save_glob = '';

my $glob_version = 0;
my $wins = 0;
my $losses = 0;
my $keys = 0;

read_glob();
my $last_fom = figure_of_merit();
for my $layer (0..2) {
print "\$layer $layer\n";
    optimize_layer( $layer );
}
print "Done!\n";
exit(0);

sub figure_of_merit {
    print "fom writing glob.save...\n";
    write_glob( $glob_version );
    print "fom running glob...\n";
    `./glob --loadfile=save$glob_version.glob --color=1.0,1.0,0.9 --matte --lumpy=0.0 --nofloor --nofill --lightradius=400.0 test.rib`;
    print "fom running rendrib...\n";
    `rendrib test.rib`;
    print "fom diffing the images...\n";
    my $fom = `../bin/pic_diff_in_bits buddha-hacked-small.tif test.tif`;
    ++$glob_version;
    print "fom: \$fom = $fom\n";
    return $fom;
}

sub optimize_parameter {
    my ($key) = @_;
    print "optimize_parameter optimizing $key\n";
    ++$keys;
    my $val = $parameter{$key};
    for my $correction (1.0, 0.5, 0.25, 0.12, 0.06, 0.03, 0.015, 0.007, 0.003, 0.001) {
        for my $pass (0..1) {
	    my $newval = $val + (($pass==0) ? $correction : -$correction);
	    update_glob( $key, $newval );
	    my $fom = figure_of_merit();
	    my $delta = $last_fom - $fom;
            print "optimize_parameter: Changing $key from $val to $newval buys us $delta  wins $wins   losses $losses   keys $keys\n";
            return if $delta == 0;
	    if ($fom >= $last_fom) {
		update_glob( $key, $val );
                ++$losses;
	    } else {
		$last_fom = $fom;
		$parameter{$key} = $newval;
		$val = $newval;
                ++$wins;
                last;
	    }
	}
    }
}

sub optimize_layer {
    my ($layer) = @_;
    print "optimize_layer: \$layer $layer\n";
    for my $key (keys %parameter) {
        print "optimize_layer: \$key $key\n";
        # For now, at least skip all but the 'r' parameters:
        if ($key =~ m/^Layer$layer\..+r$/) {
            optimize_parameter( $key );
        }
    }
}

sub read_glob {
    open IN, "save.glob" or die "Couldn't read 'save.glob'";
    while () {
	$save_glob .= $_;
	if ($_ =~ /^([^=]+)=([^=]+)\n$/) {
	    $parameter{$1} = $2;
	}
    }
    close IN;
}

sub update_glob {
    my ($key,$val) = @_;
    $save_glob =~ s/$key=.*\n/$key=$val\n/;
}


sub write_glob {
    my ($i) = @_;
    open OUT, ">save$i.glob" or die "Couldn't create 'save$i.glob'";
    print OUT $save_glob;
    close OUT;
}

sub dump_parameters {
    for my $key (keys %parameter) {
	my $val = $parameter{$key};
	print "parameter $key = $val\n";
    }
}

As usual, click on the thumbnails to see the fullsize images.

As set-up, I poked around the internet and found a nice, roundish image of a buddha face, which I cleaned up a bit in GIMP. It had some red leaves obscuring parts of the face, traces of which can still be seen:
.

Here is the the initial glob "approximation" to the target image, set up purely by hand. The speckling somewhat resenbles the texture on the target image, but purely by coincidence: It is due to having the renderer under-sample (for faster rendering) the diffuse light source I set up to model the diffuse lighting in the target image:
.
(./glob --loadfile=save0.glob --color=1.0,1.0,0.9 --matte --lumpy=0.0 --nofloor --nofill --lightradius=400.0 shape-from-shading-0.rib.)

Here is the glob after tuning each of the control points in the coarse, first-layer 4x4 bicubic mesh according to the above procedure. This process involved 41 successful (accepted) mutations and 215 unsuccessful (rejected) mutations. If you grind through the arithmetic, you'll find that the successes carry about two and a half bits of information each, the rejects carry about a fifth of a bit of information each, and that the following approximation is based on a total of about 148 bits of information about the shape of Buddha's face:
.
(./glob --loadfile=save246.glob --color=1.0,1.0,0.9 --matte --lumpy=0.0 --nofloor --nofill --lightradius=400.0 shape-from-shading-1.rib.)

Here is the glob after similarly tuning each of the control points in the finer, second-layer, 8x8 bicubic mesh. At this point we have had 89 accepted mutations and 1298 rejected mutations carrying a total of 477 bits of information about the target image:
.
(./glob --loadfile=save1367.glob --color=1.0,1.0,0.9 --matte --lumpy=0.0 --nofloor --nofill --lightradius=400.0 shape-from-shading-2.rib.)

Finally, here is the glob after similarly tuning each of the control points in the still finer, third-layer, 16x16 bicubic mesh. At this point we have had 129 accepted mutations and 6067 rejected mutations carrying a total of about 904.74 bits of information about the target image -- a bit less than the 'Buddha in a Kilobit' of the title. :) That is approximately the same information content as this paragraph.

(Claude Shannon did a classic "Twenty-questions" style experiment showing that English text contains somewhat less than two bits of information per letter.)

I am a bit disappointed neither the nose nor chin have yet started to emerge, but the general outline and the earlobes have come out very nicely: .
(./glob --loadfile=save6199.glob --color=1.0,1.0,0.9 --matte --lumpy=0.0 --nofloor --nofill --lightradius=400.0 shape-from-shading-3.rib.)

In principle we could continue in this vein, going to 32x32 and then 64x64 meshes and seeing steadily more detail emerge. In practice, though, the above run took several days to complete, a 32x32 mesh run would take something like a fortnight, and a 64x64 mesh run would take about two months. I'd rather speed up the code and do a new run than sit and wait that long. :)

What are the inefficiencies making the above run take so long?

A raytracer isn't a fast renderer. An Nvidia card can do rendering like this at hundreds of frames per second. Even a pure-software OpenGL implementation would be much faster. But a raytracer is easier to program for experiments like this. :)
The above run was done with 450 x 360 pixel images. This is gross overkill for trying to refine a 16x16 bicubic mesh model. If we cut it down to 225 x 180 resolution, it would render four times faster. If we cut it down to 112x90 pixel resolution, probably still more than enough, it would render sixteen times faster.
The approximation I'm using above to emulate diffuse lighting casts 10 lighting rays per pixel, slowing down rendering by a factor of 10. A cheap cosine approximation would do the same thing 10 times faster. Combined with the above, that would let us run about 160 times faster.
The final ratio of 129 accepted mutations to 6067 rejected mutations is very unfavorable: This is why we obtained only 900 bits of information in 6000 trials, for about 1/7 of a bit of information per trial. We would like to be getting about equal numbers of accepted and rejected mutations, gaining us about one bit of information about the the target image per iteration. Rewriting the above Perl script somewhat more cleverly would help here, and should buy us about a factor of five, for a cumulative speed-up up of about x800, reducing the above four-day run to less than ten minutes.
Most fundamentally, the above limit of one bit of information about the target image per rendering cycle is very bad, since we have to render the entire image for each cycle, and as we try to resolve steadily finer details, we shall have to do each rendering at steadily higher resolution. We can do better than this by
- Mutating more than one control point per cycle -- say, all the control points for the mesh of a given layer.
- Keeping track of which parts of the image are affected by which control points, so that we can somewhat intelligently sort out winners and losers among all the changed control points.
  One cute way to do this involves doing two renderings per cycle:
  The first is the conventional image rendering as above, with a physically plausible lighting model &tc.
  In the second rendering, we shade each bicubic patch in a constant, unique color. The resulting image forms a map which tells us, for each pixel in the primary image, which bicubic patch in the model we are seeing. (Conventional raytracing would be another way of obtaining the same information, but this trick is easier when one doesn't have access to the raytracer internals.)
  We know the 16 control points shaping each bicubic patch, so this information narrows down the control points to blame/credit for the decrease/increase in mis/match at this pixel to 16-32 points: We can thus accept/reject the changes made to these points using local information, largely independent of everything happening at other points in the image.
  The net result of this hack is that the amount of useful information we get per render cycle stays roughly proportional to the number of bicubic patches rendered as we move to steadily finer levels of detail, and thus that the asymptotic time to fit an N-bit model drops from cubic to linear, in computer jargon. In plain English, it means that instead of the procedure rapidly becoming impractical for images with significant levels of detail, it should remain equally efficient and practical to essentially arbitrary levels of image complexity.
  The speedup yielded by this improved approach relative to the above approach should be at least proportional to the mesh size: x16 for a 4x4 mesh, x64 for an 8x8 mesh, and so forth, further reducing the above four-day run from a matter of minutes to one of seconds.
If we got as far as the above, we might find ourselves wanting to use a "simulated annealing" approach instead of the above instant-quenching approach. "Simulated annealing" is inspired by the way Nature grows crystals, and involves a simulated "temperature" which controls how often we accept a change even when it looks unfavorable. Like a craftsman annealing metal or glass, we slowly drop the "temperature" until we arrive at our final solution. The advantage of this approach is that we are less likely to get stuck in a suboptimal solution which can be escaped only by taking a step or two in the 'wrong' direction. It is a very general technique, and accordingly seldom used because there is almost always a more specialized technique which uses more information to run faster. But when experimenting with a problem as ill-defined as computer vision, a very general-purpose techniquel like simulated annealing can be just the thing.

So, that's a sketch for a Shape From Shading II, when and if I get a yen for it!

There are, of course, lots and lots of cool computer vision issues completely ignored by the above little experiment and analysis. One thing at a time.

For anyone who wants to play, the source code is here.

Back to Cynbe's "Silly RenderMan Tricks" Page.

Cynbe ru Taren

Last modified: Fri Aug 16 11:16:05 CDT 2002