CHROME
CHROME is the native high-level programming language of ARGON. While HYDROGEN exists as the low-level language, it's not intended as a language for much programming; it mainly exists as a hardware abstraction layer for CHROME (in terms of basic execution environment and code generation for different architectures), HELIUM (in terms of interrupt handling and process state switching), TUNGSTEN (in terms of providing access to block storage devices), IRIDIUM and FLUORINE (in terms of providing access to the network hardware), NEON (in terms of providing access to user interaction devices), and so on. HYDROGEN is a deliberately simple language, with few features for large-scale programming.
Also, HYDROGEN has an unprecedented level of access to the low-level aspects of its own implementation. In HYDROGEN one can write code which writes random numbers to random locations in memory until the system falls over. Therefore, HYDROGEN code cannot be accepted from untrusted sources. Apart from low-level libraries, things involved in bootstrapping, and the occasional bit of hand-coded stuff that requires particularly high performance, as much HYDROGEN code as possible should be generated by higher-level language compilers - such as the CHROME compiler.
The design goals of CHROME are, in order of decreasing priority:
- Safety. It should be possible to run untrusted code, giving it access only to resources explicitly granted, with no way for the untrusted code to obtain access to other resources or interfere with the operation of other parts of the system (although issues such as the consumption of CPU time and memory are outside of the language's scope; HELIUM exists to control those). This implies generating thread-safe code, although that too is mainly an issue for other parts of the system.
- Ease of programming. The language should be simple to understand, and easy to reason about. There should be the minimum of surprises and gotchas. The language should be expressive enough to handle complex concepts naturally.
- Efficiency of implementation. The compiler should be able to easily convert the input source code to efficient HYDROGEN bytecode with the minimum of complex time-consuming analysis. There are two efficiencies at stake here - the efficiency of the compiler, and the efficiency of the resulting code. I'm also conflating time and space efficiency, too.
Summary description
And I hope to attain these goals with the following features, in no particular order.
- Code is data. Put another way, CHROME is a Lisp dialect, although quite an unusual one. Rather than Lisp's s-expressions, we have IRON ions.
- Limited access to the representation of internal state. Basically, you can't make up a pointer to something: you either get given pointers to things, or create your own objects. This, combined with the structure of the libraries for communicating with other parts of the system, provides the safety properties.
- No side effects (apart from the consumption of space and time, to be pedantic). In other words, referential transparency: every expression, in a given environment, has a single unique value.
- Uniqueness typing. The efficiency of destructive update can be brought back this way, without breaking referential transparency. It also allows the expression of imperative communication with other parts of the system.
- The ability to break referential transparency with side effects. What? Doesn't that contradict the above? Well, yes, but in order to break referential transparency, the programmer has to explicitly use a feature clearly marked as unsafe: basically, a way of encapsulating a linear state variable into a non-linear box, and another function that takes the nonlinear box and a function mapping the linear state variable to a new value, and applying that function to the linear value contained within the box, then destructively updating the box. This lets you implement caches (for memoising functions, for example). If you use it to break referential transparency, then you only have yourself to blame for the consequences; it won't break the safety requirements of the system, so only your code will feel the consequences, too...
- Generic functions. I think of these as a form of object-oriented programming, but the term "OO" has attained a lot of connotations that don't apply here, so I'll just stick with the term "generic functions". Or, to put it another way, if you believe OO is a good thing, then CHROME is OO. If you think OO is a bad thing, don't worry - CHROME isn't OO.
- First class functions. In this day and age, thankfully, I no longer have to justify this one; people are beginning to see the light.
- First class continuations. These are still a bit contentious, but becoming more popular; combined with macros, they let you express any control-flow construct you like.
- Hygienic macros, something like Scheme's syntactic closures, but with a few more reflective tools: the lexical environment can be examined to extract lists of bindings (be they constant value bindings (which macro bindings are just a form of), or unknown variables that bind to arguments of lambda expressions, and thus refer to a location in a run-time activation record that does not yet exist), and a function is provided to expand subforms of the macro in a given lexical environment.
- Partial evaluation, in that expressions with no unknowns in can be computed at compile time and a literal value substituted. This is a form of optimisation, but more importantly, in turn enables:
- Second class values. Things like macros need to be known at compile time but, unlike in Scheme, they can still be computed by expressions - as long as the values of those expressions are decidable at compile time. Therefore, macros and, indeed, lexical environments in general can be passed as parameters to function calls and the like. They are almost first class, with the one restriction that they all be decidable at compile time. Why is this a good thing? Because it makes the handling of environments and macros simpler. A let macro can take an environment and a body, and put the body in that environment by wrapping it in a closure and then applying it (or something more involved involving fixed-point combinators to express recursion, probably). Then one can use let with an inline literal environment value, or use an environment supplied from elsewhere. Environments can be modified to, for example, rename their symbols. Modules become easy to implement.
- No keywords. The special forms of the language (fun, if, and so on) are not magical symbols that are special-cased into the compiler. Rather, symbols like if are just bound in the default environment to special values that represent the special behaviour of quote forms. Thanks to partial evaluation, those symbols will be replaced by their special-form values, and those are what are special-cased in the compiler. This means that special forms are, themselves, second-class values. Interestingly, this can be used for optimisation: basic operations like + can be represented in the initial environment by special-form values that are also applicable as closures, so that a direct addition can be recognised as such and special-cased, while + can still be dereferenced to obtain a first-class function to pass to a higher-order function.
- Lexically scoped, but with a dynamic environment. A dynamic environment is useful for storing Scheme-style parameters, exception handlers, and so on.
- A very simple standardised core language, but a standard library of macros that provide higher-level forms in terms of the core. For the core primitives, I'm currently considering fun to create closures, ' to quote literal values, and if to express conditionals; although the names that are bound to these in the initial environment may be prefixed with something and macros provided to convert more featureful versions of them to the lower-level versions.
- As reflective as is reasonably possible, mainly to support smart debugging. Eg, a continuation should be examinable to obtain a "stack trace" and a look at the dynamic environment chain. It should be stressed that programmers shouldn't use this to produce overengineered metaprogramming, such as macros whose behaviour changes drastically depending on the context they are used in, etc; these interfaces exist to allow debugging tools access to the details of the context.
- How do we combine the above with our safety requirements? With the fact that we only actually support partial continuations, plus the dynamic-firewall function. We provide a call/cc function, but it just creates a partial continuation up to a marker stored in the dynamic environment and placed by the nearest dynamic firewall, thereby preventing multiple returns from a firewalled region of code due to continuations leaking out of it unless other continuation markers have been explicited passed in. Also, a dynamic firewall acts as a marker beyond which the continuation chain and dynamic environment stack cannot be introspected. Perhaps we should add flags to dynamic-firewall to list specific things that should be allowed through the firewall, but by default, a dynamic firewall prevents the code called within its dynamic extent from meddling with things from without by reflection.
- The controlled ability to peek inside closures. A closure closes over a number of environment bindings; closures may opt to expose some of them as an IRON record, so they can be examined, and modified (non-destructively; being a pure functional language, modification functions take a value and details of the change to make and return a new value... unless the original value was linear, in which case they implement this by mutating the existing one and returning it, safe in the knowledge that the unique type will guarantee referential transparency).
- Tail recursion, allowing efficient execution of elegantly expressed algorithms.
Implementation
I've spoken in terms of a compiler above. We might build a CHROME interpreter... but I'd rather make the compiler fast enough that one can just read input expressions, compile them, run them, and then let the compiled form be garbage collected. Making the compiler faster would be a better use of resources than writing an interpreter. Definitions given to the interpreter can be compiled and the resulting defined value placed into an environment object, while expressions can then be compiled within that environment and executed and their results displayed.
Since the core language is so simple, I propose that a compiler for it be hand-coded in HYDROGEN; a very basic non-optimising naive compiler. That compiler can then be used to compile the core library of macros and functions, then providing a base system that can be used to compile the "real" optimising compiler, which will be written in CHROME. Said compiler can then compile itself again, producing an optimised version, at which point the old compiler and the basic library it compiled can be discarded, the basic library compiled again with the optimisation compiler, then we're ready for business. This resulting compiler and basic library can be cached for future use, if the cost of re-bootstrapping from the ground up on system boot is too great, but I hope that I succeed in my goal of an efficient compiler for an efficiently compileable language sufficiently to enable a full bootstrap to occur on every boot, as it makes the bootstrap process simpler, and thus easier to audit for security.
As for the structure of the compiler: the excellent book Compiling with Continuations, especially with the more recent insights that have come up since.
Open questions
It's easy to assemble a grab-bag of your favourite programming language features ("My dad is like Spiderman but with Superman's laser eyes and Batman's car!"), but the devil is in how to fit them together. However, rest assured; I know what I'm doing... mostly. There are a few little niggling issues I still want to sit down and explore. Perhaps I'll have to do some experiments.
- Linear typing has far-reaching effects on the programming model. Functions that extract part of a compound object (eg, an element of a vector) have to return both the element *and* the original object; ideally as an easily-ignored second value, so that when dealing with non-linear data one can pretend they don't. In general, lots more functions return multiple values, requiring really easy syntax to deal with them. "Linear programs" usually have a form like a big long let*-values; lots of statements that call a function then bind the results to names that we then use in subsequent statements. In this way, it starts to resemble imperative programming somewhat - which is no surprise. We need to make the standard library of functions and macros makes parts of programs that deal with lots of linear state natural, while not penalising the more traditional parts of programs.
- We need to enforce linearity of linear-only values (such as the world) up front as much as possible, with static type checking. But Lisp being Lisp, we'll sometimes (often?) need to enforce it at run time as well.
- In particular, the way continuations (and the structures that build upon them, like exception handlers) are handled needs to take account of linear state. An exception in the middle of some code that deals with one or more linear states must wrap all those states into the exception for the exception handler to then pass on. Ok linear values that have destructor functions can be automatically destroyed (eg, things that just represent a value in memory) but objects like the "world" need to be kept track of, or they'll go missing. Doing this elegantly might be interesting. Do we need to define some automatic way of capturing linear state variables from the dynamic environment as we unwind it? That could truly suck.