Theory Foundations
theory Foundations
imports Introduction
begin
section ‹Code generation foundations \label{sec:foundations}›
subsection ‹Code generator architecture \label{sec:architecture}›
text ‹
The code generator is actually a framework consisting of different
components which can be customised individually.
Conceptually all components operate on Isabelle's logic framework
Pure. Practically, the object logic HOL
provides the necessary facilities to make use of the code generator,
mainly since it is an extension of Isabelle/Pure.
The constellation of the different components is visualized in the
following picture.
\begin{figure}[h]
\def\sys#1{\emph{#1}}
\begin{tikzpicture}[x = 4cm, y = 1cm]
\tikzstyle positive=[color = black, fill = white];
\tikzstyle negative=[color = white, fill = black];
\tikzstyle entity=[rounded corners, draw, thick];
\tikzstyle process=[ellipse, draw, thick];
\tikzstyle arrow=[-stealth, semithick];
\node (spec) at (0, 3) [entity, positive] {specification tools};
\node (user) at (1, 3) [entity, positive] {user proofs};
\node (spec_user_join) at (0.5, 3) [shape=coordinate] {};
\node (raw) at (0.5, 4) [entity, positive] {raw code equations};
\node (pre) at (1.5, 4) [process, positive] {preprocessing};
\node (eqn) at (2.5, 4) [entity, positive] {code equations};
\node (iml) at (0.5, 0) [entity, positive] {intermediate program};
\node (seri) at (1.5, 0) [process, positive] {serialisation};
\node (SML) at (2.5, 3) [entity, positive] {\sys{SML}};
\node (OCaml) at (2.5, 2) [entity, positive] {\sys{OCaml}};
\node (Haskell) at (2.5, 1) [entity, positive] {\sys{Haskell}};
\node (Scala) at (2.5, 0) [entity, positive] {\sys{Scala}};
\draw [semithick] (spec) -- (spec_user_join);
\draw [semithick] (user) -- (spec_user_join);
\draw [-diamond, semithick] (spec_user_join) -- (raw);
\draw [arrow] (raw) -- (pre);
\draw [arrow] (pre) -- (eqn);
\draw [arrow] (eqn) -- node (transl) [process, positive] {translation} (iml);
\draw [arrow] (iml) -- (seri);
\draw [arrow] (seri) -- (SML);
\draw [arrow] (seri) -- (OCaml);
\draw [arrow] (seri) -- (Haskell);
\draw [arrow] (seri) -- (Scala);
\end{tikzpicture}
\caption{Code generator architecture}
\label{fig:arch}
\end{figure}
\noindent Central to code generation is the notion of \emph{code
equations}. A code equation as a first approximation is a theorem
of the form ‹f t⇩1 t⇩2 … t⇩n ≡ t› (an equation headed by a
constant ‹f› with arguments ‹t⇩1 t⇩2 … t⇩n› and right
hand side ‹t›).
\begin{itemize}
\item Starting point of code generation is a collection of (raw)
code equations in a theory. It is not relevant where they stem
from, but typically they were either produced by specification
tools or proved explicitly by the user.
\item These raw code equations can be subjected to theorem
transformations. This \qn{preprocessor} (see
\secref{sec:preproc}) can apply the full expressiveness of
ML-based theorem transformations to code generation. The result
of preprocessing is a structured collection of code equations.
\item These code equations are \qn{translated} to a program in an
abstract intermediate language. Think of it as a kind of
\qt{Mini-Haskell} with four \qn{statements}: ‹data› (for
datatypes), ‹fun› (stemming from code equations), also
‹class› and ‹inst› (for type classes).
\item Finally, the abstract program is \qn{serialised} into
concrete source code of a target language. This step only
produces concrete syntax but does not change the program in
essence; all conceptual transformations occur in the translation
step.
\end{itemize}
\noindent From these steps, only the last two are carried out
outside the logic; by keeping this layer as thin as possible, the
amount of code to trust is kept to a minimum.
›
subsection ‹The pre- and postprocessor \label{sec:preproc}›
text ‹
Before selected function theorems are turned into abstract code, a
chain of definitional transformation steps is carried out:
\emph{preprocessing}. The preprocessor consists of two
components: a \emph{simpset} and \emph{function transformers}.
The preprocessor simpset has a disparate brother, the
\emph{postprocessor simpset}.
In the theory-to-code scenario depicted in the picture
above, it plays no role. But if generated code is used
to evaluate expressions (cf.~\secref{sec:evaluation}), the
postprocessor simpset is applied to the resulting expression before this
is turned back.
The pre- and postprocessor \emph{simpsets} can apply the
full generality of the Isabelle
simplifier. Due to the interpretation of theorems as code
equations, rewrites are applied to the right hand side and the
arguments of the left hand side of an equation, but never to the
constant heading the left hand side.
Pre- and postprocessor can be setup to transfer between
expressions suitable for logical reasoning and expressions
suitable for execution. As example, take list membership; logically
it is expressed as \<^term>‹x ∈ set xs›. But for execution
the intermediate set is not desirable. Hence the following
specification:
›
definition %quote member :: "'a list ⇒ 'a ⇒ bool"
where
[code_abbrev]: "member xs x ⟷ x ∈ set xs"
text ‹
\noindent The \emph{@{attribute code_abbrev} attribute} declares
its theorem a rewrite rule for the postprocessor and the symmetric
of its theorem as rewrite rule for the preprocessor. Together,
this has the effect that expressions @{thm (rhs) member_def [no_vars]}
are replaced by @{thm (lhs) member_def [no_vars]} in generated code, but
are turned back into @{thm (rhs) member_def [no_vars]} if generated
code is used for evaluation.
Rewrite rules for pre- or postprocessor may be
declared independently using \emph{@{attribute code_unfold}}
or \emph{@{attribute code_post}} respectively.
\emph{Function transformers} provide a very general
interface, transforming a list of function theorems to another list
of function theorems, provided that neither the heading constant nor
its type change. The \<^term>‹0::nat› / \<^const>‹Suc› pattern
used in theory ‹Code_Abstract_Nat› (see \secref{abstract_nat})
uses this interface.
\noindent The current setup of the pre- and postprocessor may be inspected
using the @{command_def print_codeproc} command. @{command_def
code_thms} (see \secref{sec:equations}) provides a convenient
mechanism to inspect the impact of a preprocessor setup on code
equations. Attribute @{attribute code_preproc_trace} allows
for low-level tracing:
›
declare %quote [[code_preproc_trace]]
declare %quote [[code_preproc_trace only: distinct remdups]]
declare %quote [[code_preproc_trace off]]
subsection ‹Understanding code equations \label{sec:equations}›
text ‹
As told in \secref{sec:principle}, the notion of code equations is
vital to code generation. Indeed most problems which occur in
practice can be resolved by an inspection of the underlying code
equations.
It is possible to exchange the default code equations for constants
by explicitly proving alternative ones:
›
lemma %quote [code]:
"dequeue (AQueue xs []) =
(if xs = [] then (None, AQueue [] [])
else dequeue (AQueue [] (rev xs)))"
"dequeue (AQueue xs (y # ys)) =
(Some y, AQueue xs ys)"
by (cases xs, simp_all) (cases "rev xs", simp_all)
text ‹
\noindent The annotation ‹[code]› is an ‹attribute›
which states that the given theorems should be considered as code
equations for a ‹fun› statement -- the corresponding constant
is determined syntactically. The resulting code:
›
text %quote ‹
@{code_stmts dequeue constant: dequeue (Haskell)}
›
text ‹
\noindent You may note that the equality test \<^term>‹xs = []› has
been replaced by the predicate \<^term>‹List.null xs›. This is due
to the default setup of the \qn{preprocessor}.
This possibility to select arbitrary code equations is the key
technique for program and datatype refinement (see
\secref{sec:refinement}).
Due to the preprocessor, there is the distinction of raw code
equations (before preprocessing) and code equations (after
preprocessing).
The first can be listed (among other data) using the @{command_def
print_codesetup} command.
The code equations after preprocessing are already are blueprint of
the generated program and can be inspected using the @{command
code_thms} command:
›
code_thms %quote dequeue
text ‹
\noindent This prints a table with the code equations for \<^const>‹dequeue›, including \emph{all} code equations those equations depend
on recursively. These dependencies themselves can be visualized using
the @{command_def code_deps} command.
›
subsection ‹Equality›
text ‹
Implementation of equality deserves some attention. Here an example
function involving polymorphic equality:
›
primrec %quote collect_duplicates :: "'a list ⇒ 'a list ⇒ 'a list ⇒ 'a list" where
"collect_duplicates xs ys [] = xs"
| "collect_duplicates xs ys (z#zs) = (if z ∈ set xs
then if z ∈ set ys
then collect_duplicates xs ys zs
else collect_duplicates xs (z#ys) zs
else collect_duplicates (z#xs) (z#ys) zs)"
text ‹
\noindent During preprocessing, the membership test is rewritten,
resulting in \<^const>‹List.member›, which itself performs an explicit
equality check, as can be seen in the corresponding ‹SML› code:
›
text %quote ‹
@{code_stmts collect_duplicates (SML)}
›
text ‹
\noindent Obviously, polymorphic equality is implemented the Haskell
way using a type class. How is this achieved? HOL introduces an
explicit class \<^class>‹equal› with a corresponding operation \<^const>‹HOL.equal› such that @{thm equal [no_vars]}. The preprocessing
framework does the rest by propagating the \<^class>‹equal› constraints
through all dependent code equations. For datatypes, instances of
\<^class>‹equal› are implicitly derived when possible. For other types,
you may instantiate ‹equal› manually like any other type class.
›
subsection ‹Explicit partiality \label{sec:partiality}›
text ‹
Explicit partiality is caused by missing patterns
(in contrast to partiality caused by nontermination, which is covered in Section~\ref{sec:partial}).
Here is an example, again for amortised queues:
›
definition %quote strict_dequeue :: "'a queue ⇒ 'a × 'a queue" where
"strict_dequeue q = (case dequeue q
of (Some x, q') ⇒ (x, q'))"
lemma %quote strict_dequeue_AQueue [code]:
"strict_dequeue (AQueue xs (y # ys)) = (y, AQueue xs ys)"
"strict_dequeue (AQueue xs []) =
(case rev xs of y # ys ⇒ (y, AQueue [] ys))"
by (simp_all add: strict_dequeue_def) (cases xs, simp_all split: list.split)
text ‹
\noindent In the corresponding code, there is no equation
for the pattern \<^term>‹AQueue [] []›:
›
text %quote ‹
@{code_stmts strict_dequeue constant: strict_dequeue (Haskell)}
›
text ‹
\noindent In some cases it is desirable to state this
pseudo-\qt{partiality} more explicitly, e.g.~as follows:
›
axiomatization %quote empty_queue :: 'a
definition %quote strict_dequeue' :: "'a queue ⇒ 'a × 'a queue" where
"strict_dequeue' q = (case dequeue q of (Some x, q') ⇒ (x, q')
| _ ⇒ empty_queue)"
lemma %quote strict_dequeue'_AQueue [code]:
"strict_dequeue' (AQueue xs []) = (if xs = [] then empty_queue
else strict_dequeue' (AQueue [] (rev xs)))"
"strict_dequeue' (AQueue xs (y # ys)) =
(y, AQueue xs ys)"
by (simp_all add: strict_dequeue'_def split: list.splits)
text ‹
Observe that on the right hand side of the definition of \<^const>‹strict_dequeue'›, the unspecified constant \<^const>‹empty_queue› occurs.
An attempt to generate code for \<^const>‹strict_dequeue'› would
make the code generator complain that \<^const>‹empty_queue› has
no associated code equations. In most situations unimplemented
constants indeed indicated a broken program; however such
constants can also be thought of as function definitions which always fail,
since there is never a successful pattern match on the left hand
side. In order to categorise a constant into that category
explicitly, use the @{attribute code} attribute with
‹abort›:
›
declare %quote [[code abort: empty_queue]]
text ‹
\noindent Then the code generator will just insert an error or
exception at the appropriate position:
›
text %quote ‹
@{code_stmts strict_dequeue' constant: empty_queue strict_dequeue' (Haskell)}
›
text ‹
\noindent This feature however is rarely needed in practice. Note
that the HOL default setup already includes
›
declare %quote [[code abort: undefined]]
text ‹
\noindent -- hence \<^const>‹undefined› can always be used in such
situations.
›
subsection ‹If something goes utterly wrong \label{sec:utterly_wrong}›
text ‹
Under certain circumstances, the code generator fails to produce
code entirely. To debug these, the following hints may prove
helpful:
\begin{description}
\ditem{\emph{Check with a different target language}.} Sometimes
the situation gets more clear if you switch to another target
language; the code generated there might give some hints what
prevents the code generator to produce code for the desired
language.
\ditem{\emph{Inspect code equations}.} Code equations are the central
carrier of code generation. Most problems occurring while generating
code can be traced to single equations which are printed as part of
the error message. A closer inspection of those may offer the key
for solving issues (cf.~\secref{sec:equations}).
\ditem{\emph{Inspect preprocessor setup}.} The preprocessor might
transform code equations unexpectedly; to understand an
inspection of its setup is necessary (cf.~\secref{sec:preproc}).
\ditem{\emph{Generate exceptions}.} If the code generator
complains about missing code equations, in can be helpful to
implement the offending constants as exceptions
(cf.~\secref{sec:partiality}); this allows at least for a formal
generation of code, whose inspection may then give clues what is
wrong.
\ditem{\emph{Remove offending code equations}.} If code
generation is prevented by just a single equation, this can be
removed (cf.~\secref{sec:equations}) to allow formal code
generation, whose result in turn can be used to trace the
problem. The most prominent case here are mismatches in type
class signatures (\qt{wellsortedness error}).
\end{description}
›
end