General jargon will be covered later, but a few words and
concepts must be mastered at the outset:
- history:
SML/NJ was originally written by David MacQueen and Andrew Appel
starting in 1986. That's just shy of twenty years ago, which is
time for considerable water to have passed under the bridge, for
two new language standards, and for various changes in direction
of the implementation. This means that the compiler contains
a fair number of historical artifacts -- things that would
probably be done differently today if one were starting from
scratch.
- static vs dynamic:
The Definition of Standard ML makes a hard and fast distinction
between compiletime and runtime semantics.
Compiletime stuff
-- mostly typechecking -- is called static and runtime stuff is called dynamic.
Maintaining this clean phase separation is central to the
design of the language and its various follow-on extensions,
in particular to the module system.
In general, the compiler uses separate code and datastructures
for the two, which means that they can be studied and understood
separately.
- Elaboration:
Type-checking in SML is called elaboration. (The intuition
is that you are taking the raw syntax tree from the parser and
adding to it elaborate decorations in the form of deduced type
information.)
Type elaboration in SML is a deductive process
based on the "unification" operation core to Prolog and logic
programming. It is much more powerful and convenient to the
application programmer than vanilla C-style typechecking. It
is also correspondingly harder to understand theoretically
and implement practically in a compiler.
- Tycon:
Object-oriented languages have runtime functions called constructors
which take a few arguments and return a newly created object. In
vaguely analogous fashion, SML has compiletime functions
called type constructors (usually called just
tycons) which take a few type arguments and return
a newly created type. For example, the tycon list,
when applied to type int returns a list-of-ints
type.
- Environments:
Unix programs have "environments" consisting of string-pairs
like "HOME=/home/cynbe" "SHELL=/bin/tcsh" and such. Such
environments are essentially simple symbol tables mapping
variables to their current values. SML/NJ uses the word
"environment" in this sense, and pervasively refers to all
symbol tables and similar datastructures as environments.
- Static vs Dynamic Environments:
Following the above-described phase distinction between
static and dynamic semantics, the SML/NJ compiler maintains
separate static environment (usually just statenv)
and dynamic environment (usually just dynenv)
symbol tables.
The statenv contains the type-checking information
and is the one by far most referenced, particularly during
elaboration.
The dynenv is not actually used at runtime, but
contains information which is conceptually used at runtime.
(If SML/NJ were an interpreter rather than a compiler, it
might actually use the dynenv at runtime.) In
particular, it contains information such as the offset
of fields within records, and the assignments of variables
to particular registers or memory locations.
- All Gaul was divided into three parts:
The compiler most generally breaks down into three phases:
- The front end, which does lexing, parsing and type elaboration.
- The "middle end", which does machine-independent optimizations.
- The back end, which does machine-dependent optimizations and
code generation.
- front end:
The modern SML/NJ front end is closer in spirit to
the original implementation than the other two parts of the
compiler. It has become gradually more purely functional over
time. It is maintained by
David MacQueen (UChicago),
Andrew Appel (Princeton),
John Reppy (UChicago)
and
Matthias Blume (TTI at UChicago), among others.
The lexer is automatically generated by ml-lex,
an ML-specific tool much like standard Unix flex.
The
parser is automatically generated by ml-yacc, a matching
ML-specific tool closely patterned on the standard Unix yacc.
It produces a concrete syntax tree of type ast (for
"abstract syntax tree", even though it is concrete).
The elaborator then converts the ast into a true
abstract syntax tree of type absyn, as well as type-checking.
- middle end:
The SML/NJ middle end has evolved into a separate project,
the Flint Project,
maintained by a Yale team led by
Zhong Shao
Most transformations are done in a CPS ("continuation passing style")
internal representation. NB: Andrew Appel, co-author of SML/NJ,
has a number of compiler books out, including Compiling
with Continuations and Modern Compiler Implementation in
ML
- back end:
The original SML/NJ back end has been replaced by a the
MLRISC
retargetable back end, which is maintained by
Lal George
and (especially recently) NYU's Allen
Leung, and also used in a number of
other compilers.
Back to