This one isn't going to impress anyone but me, but it is something I've been wanting to try for years.
One of the motivating ideas is that a high-quality renderer can be an effective tool for image analysis as well as image generation. A Renderman-class photorealistic renderer can be thought of as a system for reasoning about the optical consequences of a particular arrangement of lights and objects. By forming a hypothesis of what the scene looks like and then rendering it and comparing to the input image, we can form some idea of how plausible our hypothesis is, taking into account shadows and indirect lighting and so forth to whatever degree of realism supported by the renderer.
Another motivating idea is that by analysing a scene in this way we can gain a reasonably accurate idea of the information content of our model, something which is otherwise difficult to do. Explaining this idea will require a detour into a tuppence lecture on the elementary application of Claude Shannon's information theory. (Since I am not a mathematician and have no foggiest clue about the serious mathematics of information theory, elementary applications of its simplest results are as far as I can take the discussion...) Feel free to skip this section:
Shannon defines a 'bit' to be the amount of information required to answer a yes/no question which we were initially completely uncertain about: The answer could just as easily be 'yes' as 'no', so far as we know. "Information" in this sense is a measure of the decrease in our ignorance of the world.
It is essential to distinguish between this information-theoretic meaning of "bit" and the colloquial computer meaning of "bit" as a unit of storage. They are very different, and it is most unfortunate that they wound up using the same word, but all attempts to change the terminology have failed, so we're stuck with it.
To bring the difference into clear focus, note that an empty recordable CD, containing nothing but zeros, has about six billion bits of storage but zero bits of information: Since we know ahead of time that all bits are zero, we learn nothing from examining any of them: Our ignorance of the universe is decreased not at all by reading the contents of the disk.
Shannon's information theory is usually cast in the form of an information stream much like an old-fashioned tickertape: A stream consisting of an indefinitely long sequence of symbols drawn from a fixed, finite "alphabet", emerging one at a time from some source. Depending on the application, the "alphabet" might consist of just the symbols 0 and 1, or it might be the ASCII character set encoding email in English, or it might be bytes of information being recieved from the Pioneer 10 spacecraft.
The basic question addressed is: How many bits of information does such a symbol carry, in general?
Shannon's answer: It carries log2( 1 / p ) bits of information, where p is the a priori probability of seeing that symbol appear at that point in the stream.
So if p is nearly 1.0 -- we were virtually certain that that symbol would appear next -- then the number of bits of information conveyed by it is nearly zero, as expected: It didn't tell us anything we didn't already know.
Conversely, if p is nearly 0.0 -- we were virtually certain that that symbol couldn't possibly appear next -- then the number of bits of information conveyed by it is enormous, as expected: The symbol told us that our understanding of the data stream pattern was very, very wrong, and thus taught us a great deal.
The final useful idea we need from elementary information theory is that such an information stream carries the maximum amount of information when the different symbols are equally probable. For example, if we are playing "Twenty Questions", trying to guess an object by asking twenty yes-or-no questions about it, then information theory tells us that we do best to ask questions which are equally likely to be answered "No" or "Yes": Otherwise, we are wasting bandwidth, learning less than we might per question, and reducing our chances of winning. (Simple arithmetic tells us that twenty such questions allow us to pick out one object from among at most a million possibilities -- suggesting that the world of the children who play this game contains roughly a million psychologically significant objects.)
With that in hand, the idea is to start out with our hierarchical blob, and to iteratively, experimentally reshape it to match the object in a given image. At each cycle in the iteration, we will
Very simple-minded, but an easy toe in the water to begin experimenting with this idea.
Implementation:
First, here is a teeny Perl script to test how "close" two images are to being identical. It subtracts them using the ImageMagick 'composite' program, and then bzip2's the result (that being my notion of a quick-and-dirty way of squeezing out uninteresting redundancy and arriving at something closer to the true information content) and finally takes the length in bytes of the compressed difference file:
#!/usr/bin/perl -w use strict; my $picA = shift; my $picB = shift; `/usr/bin/composite -compose Difference $picA $picB xxx.tif`; `bzip2 xxx.tif`; my $ls = `ls -l xxx.tif.bz2`; my ($len_in_bytes ) = ($ls =~ /\d+\D+(\d+)/); print "$len_in_bytes\n"; `rm xxx.tif.bz2`;
Secondly, I tweaked my 'glob' program (see basic globs) to read and write a state file containing its control points, to allow us to easily mutate a glob, using --savefile=foo.glob and --loadfile=foo.glob command-line switches.
Finally, I wrote a simple little Perl script to implement the above-sketched mutate-render-compare cycle:
#!/usr/bin/perl -w use strict; my %parameter = (); my $save_glob = ''; my $glob_version = 0; my $wins = 0; my $losses = 0; my $keys = 0; read_glob(); my $last_fom = figure_of_merit(); for my $layer (0..2) { print "\$layer $layer\n"; optimize_layer( $layer ); } print "Done!\n"; exit(0); sub figure_of_merit { print "fom writing glob.save...\n"; write_glob( $glob_version ); print "fom running glob...\n"; `./glob --loadfile=save$glob_version.glob --color=1.0,1.0,0.9 --matte --lumpy=0.0 --nofloor --nofill --lightradius=400.0 test.rib`; print "fom running rendrib...\n"; `rendrib test.rib`; print "fom diffing the images...\n"; my $fom = `../bin/pic_diff_in_bits buddha-hacked-small.tif test.tif`; ++$glob_version; print "fom: \$fom = $fom\n"; return $fom; } sub optimize_parameter { my ($key) = @_; print "optimize_parameter optimizing $key\n"; ++$keys; my $val = $parameter{$key}; for my $correction (1.0, 0.5, 0.25, 0.12, 0.06, 0.03, 0.015, 0.007, 0.003, 0.001) { for my $pass (0..1) { my $newval = $val + (($pass==0) ? $correction : -$correction); update_glob( $key, $newval ); my $fom = figure_of_merit(); my $delta = $last_fom - $fom; print "optimize_parameter: Changing $key from $val to $newval buys us $delta wins $wins losses $losses keys $keys\n"; return if $delta == 0; if ($fom >= $last_fom) { update_glob( $key, $val ); ++$losses; } else { $last_fom = $fom; $parameter{$key} = $newval; $val = $newval; ++$wins; last; } } } } sub optimize_layer { my ($layer) = @_; print "optimize_layer: \$layer $layer\n"; for my $key (keys %parameter) { print "optimize_layer: \$key $key\n"; # For now, at least skip all but the 'r' parameters: if ($key =~ m/^Layer$layer\..+r$/) { optimize_parameter( $key ); } } } sub read_glob { open IN, "save.glob" or die "Couldn't read 'save.glob'"; while () { $save_glob .= $_; if ($_ =~ /^([^=]+)=([^=]+)\n$/) { $parameter{$1} = $2; } } close IN; } sub update_glob { my ($key,$val) = @_; $save_glob =~ s/$key=.*\n/$key=$val\n/; } sub write_glob { my ($i) = @_; open OUT, ">save$i.glob" or die "Couldn't create 'save$i.glob'"; print OUT $save_glob; close OUT; } sub dump_parameters { for my $key (keys %parameter) { my $val = $parameter{$key}; print "parameter $key = $val\n"; } }
As usual, click on the thumbnails to see the fullsize images.
As set-up, I poked around the internet and found a nice,
roundish image of a buddha face, which I cleaned up a bit
in GIMP. It had some red leaves obscuring parts of the face,
traces of which can still be seen:
.
Here is the the initial glob "approximation" to the target
image, set up purely by hand. The speckling somewhat resenbles
the texture on the target image, but purely by coincidence: It
is due to having the renderer under-sample (for faster rendering) the diffuse light
source I set up to model the diffuse lighting in the target image:
.
(./glob --loadfile=save0.glob --color=1.0,1.0,0.9 --matte --lumpy=0.0 --nofloor --nofill --lightradius=400.0 shape-from-shading-0.rib.)
Here is the glob after tuning each of the control points in the
coarse, first-layer 4x4 bicubic mesh according to the above
procedure. This process involved 41 successful (accepted) mutations and
215 unsuccessful (rejected) mutations. If you grind through the
arithmetic, you'll find that the successes carry about two and a half
bits of information each, the rejects carry about a fifth of a
bit of information each, and that the following approximation is
based on a total of about 148 bits of information about the
shape of Buddha's face:
.
(./glob --loadfile=save246.glob --color=1.0,1.0,0.9 --matte --lumpy=0.0 --nofloor --nofill --lightradius=400.0 shape-from-shading-1.rib.)
Here is the glob after similarly tuning each of the control points
in the finer, second-layer, 8x8 bicubic mesh. At this point we
have had 89 accepted mutations and 1298 rejected mutations
carrying a total of 477 bits of information about the target image:
.
(./glob --loadfile=save1367.glob --color=1.0,1.0,0.9 --matte --lumpy=0.0 --nofloor --nofill --lightradius=400.0 shape-from-shading-2.rib.)
Finally, here is the glob after similarly tuning each of the control points in the still finer, third-layer, 16x16 bicubic mesh. At this point we have had 129 accepted mutations and 6067 rejected mutations carrying a total of about 904.74 bits of information about the target image -- a bit less than the 'Buddha in a Kilobit' of the title. :) That is approximately the same information content as this paragraph.
(Claude Shannon did a classic "Twenty-questions" style experiment showing that English text contains somewhat less than two bits of information per letter.)
I am a bit disappointed neither the nose nor chin have yet started
to emerge, but the general outline and the earlobes have come out
very nicely:
.
(./glob --loadfile=save6199.glob --color=1.0,1.0,0.9 --matte --lumpy=0.0 --nofloor --nofill --lightradius=400.0 shape-from-shading-3.rib.)
In principle we could continue in this vein, going to 32x32 and then 64x64 meshes and seeing steadily more detail emerge. In practice, though, the above run took several days to complete, a 32x32 mesh run would take something like a fortnight, and a 64x64 mesh run would take about two months. I'd rather speed up the code and do a new run than sit and wait that long. :)
What are the inefficiencies making the above run take so long?
One cute way to do this involves doing two renderings per cycle:
The first is the conventional image rendering as above, with a physically plausible lighting model &tc.
In the second rendering, we shade each bicubic patch in a constant, unique color. The resulting image forms a map which tells us, for each pixel in the primary image, which bicubic patch in the model we are seeing. (Conventional raytracing would be another way of obtaining the same information, but this trick is easier when one doesn't have access to the raytracer internals.)
We know the 16 control points shaping each bicubic patch, so this information narrows down the control points to blame/credit for the decrease/increase in mis/match at this pixel to 16-32 points: We can thus accept/reject the changes made to these points using local information, largely independent of everything happening at other points in the image.
The net result of this hack is that the amount of useful information we get per render cycle stays roughly proportional to the number of bicubic patches rendered as we move to steadily finer levels of detail, and thus that the asymptotic time to fit an N-bit model drops from cubic to linear, in computer jargon. In plain English, it means that instead of the procedure rapidly becoming impractical for images with significant levels of detail, it should remain equally efficient and practical to essentially arbitrary levels of image complexity.
The speedup yielded by this improved approach relative to the above approach should be at least proportional to the mesh size: x16 for a 4x4 mesh, x64 for an 8x8 mesh, and so forth, further reducing the above four-day run from a matter of minutes to one of seconds.
So, that's a sketch for a Shape From Shading II, when and if I get a yen for it!
There are, of course, lots and lots of cool computer vision issues completely ignored by the above little experiment and analysis. One thing at a time.
For anyone who wants to play, the source code is here.
Back to Cynbe's "Silly RenderMan Tricks" Page.