The IRON Data Model Specification AlaricSnell-Pym Ltd Foreword IRON is a data model for values. Although I'm still deciding how the mutable data structures like queues fit into things (specifications of them are definitely needed for TUNGSTEN, but whether they count as part of IRON or not is something I'm still debating), I think I may have settled on a basic syntax for written values. Now, the key requirement here is that IRON is, in the manner of S-expressions, usable to express just about anything - from source code to boring data. Creating a written data syntax that's pleasant enough to use day in and day out is quite a challenge. s-expressions come pretty close, but are deficient in a few areas. YAML is pretty good, but I wouldn't want to write source code in it. The main thing I'm adding over s-expressions is Smalltalk-like syntax, which I will explain in detail below. Textual form Herein I define representations of IRON values in terms of sequences of characters. How those characters are encoded into a binary stream is another matter entirely. Atoms Integers IRON integers can be anything from negative to positive infinity, subject only to the constraints of implementation. An implementation must be able to support at least what you'd get from 32 two's-complement signed bits, but should be able to support arbitrary precision integers, if memory permits. There are several representations available: A sequence of decimal digits, with an optional leading - symbol, represents an integer in base 10 The sequence 0x followed by a sequence of hexadecimal digits represents an integer in base 16. Negative integers can be represented with a leading -0x, not 0x-. The sequence 0b followed by a sequence of binary digits represents an integer in binary. Again, negative binary numbers are written starting with -0b not 0b-. Rational numbers Rationals can be written in decimal, hexadecimal, or binary, as per integers; they distinguish themselves from integers with the presence of a / at some point, dividing the number into an integer numerator at the start and a natural (positive integer) denominator at the end. A base prefix may occur at the start of the rational value, and applies to both numerator and denominator; the denominator may not have a base prefix if its own. For example, 5/2, 0x5/2 and 0b101/10 all represent the same number. Floating point numbers Floats can be written in decimal, hexadecimal, or binary, as per integers; they distinguish themselves from integers with the presence of a . at some point; they may also optionally have an exponent appended, of the form e followed by an integer in the same base as the entire float (eg, 0x and 0b prefixes are forbidden, and the base is inhereted from the float's prefix, if any) with optional leading -. For example, 2.5, .25e1, 0b.101e10, and 0x2.8 all represent the same number. Rational numbers Rational numbers consist of a signed integer known as the numerator, and a positive non-zero integer known as the denominator. They are written using any of the above integer syntaxes, with the numerator first, then a /, then the denominator. Characters An IRON character can be any Unicode character. I'm going to have to look in the unicode specs for the exact terminology, but by 'character' I explicitly disallow control codes, things like BOMs, surrogates, and combining characters. Those are things used to implement characters and representational details of strings, not characters themselves. A character can be represented in several ways: A sequence of the form #' followed by the character then terminated by '. Eg, #'x' A sequence of the form #ucs( followed by the UCS codepoint in hexadecimal then a closing ). Eg, #ucs(12ee). Any combining codepoints or modifiers required to build the character can be added too, as they form one logical character, by appending them after a + sign; for example, #ucs(12ee+3c04) Booleans Booleans may have only two values, true and false. Boolean true can be written #t and false #f. Symbols An IRON symbol is a complex beast in some ways, and very simple in others. At heart, it is a list of strings. s-expression symbols are just strings, but IRON symbols are contained within namespaces, which are hierarchical sequences of names. In fact, IRON symbols are names in the CARBON directory, but that's irrelevant here. However, the internal structure of symbols is generally unimportant, except in certain low-level operations. The important operation upon symbols is testing them for equality, and as such, implementations are encouraged to "intern" symbols into a global hash table, and just use pointers to each symbol's unique representation in memory as the symbol value, so that identity comparison is just a pointer test. Two symbols should compare equal if and only if they are identical lists of identical strings (case sensitive). Symbols may not contain whitespace characters of any kind. No component of a symbol may start with a digit. There are a few written syntaxes for symbols: The full unambigious absolute path: every element of the path, in order, separated by / characters, bracketed by < and >. Special characters (/,\,<, and >) in each path component can be escaped with prefixed \ characters. Relative symbols, which refer to the current namespace, are just written as a sequence of one or more path components, separated by /, with :, / and \ characters escaped with prefixed \ characters. The actual value of the symbol is found by appending the supplied path components to the current namespace. If the first character of the symbol would be !, a digit, ;, (, [, {, < or # then it must also be escaped with a prefixed \ character to avoid ambiguity. Prefixed symbols, which refer to a declared namespace, are written as the namespace name (which has the syntax of a symbol path component) followed by : then one or more path components, separated by /. As before, :, / and \ characters may be escaped with a prefixed \. The actual value of the symbol is found by appending the supplied path components to the namespace bound to the supplied namespace name. The declaration of the current namespace, or named namespaces, is explained later. The empty list The empty list, as in s-expressions, is written as (). Nil Nil, a value that explicitly represents the absence of a value, is written #nil. Compound types Pairs IRON has pairs, also known as cons cells, just as Lisp does. Likewise, it has Lisp's syntax for pairs (and syntactic sugar for lists). A basic pair can be written as ( followed by the first value, then ., then the second value, then ). Linked lists of pairs are written just as in s-expressions. ( followed by a space-separated list of elements then ). Improper lists can be written as ( followed by a space-separated list of elements then . followed by the tail and then ). Maps We inherit more from YAML than s-expressions in having a map type in the core. Maps are notionally sets of pairs, with the constraint that no two pairs in the set may share the first element. They might be implemented as a hash table, but there are many situations in which they should not be. The written representation for them is { followed by zero or more elements, written as space-separated pairs of values, with : characters separating them for readability, and the map is terminated with }. Records Records, however, inherit more from Smalltalk, at least in syntax. The notional representation of a record is as a symbol followed by a list of values. The symbol is known as the 'type' of the record, and list of zero or more values known as the 'fields'. However, the written representation is somewhat special, and attaches special meaning to colons in the last component of the type symbol. The simplest representation is for records whose type symbol has no colons in the last component. These are written as a [ followed by the type symbol (in any of the symbol representations listed above), then some whitespace, then the space-separated list of fields, terminated with a ]. For example, [+ 1 2 3]. However, if the last component of the type symbol contains colons, then the last character of the component must itself be a colon, or else the symbol cannot be used as a record type symbol. The last component of the type symbol can be considered as a concatenated sequence of colon-terminated strings known as the field names. The written representation of a record with such a type symbol again starts with [, but it is now followed by the type symbol, but only including the first field name. This can be written in any of the symbol syntaxes listed above, but it does not necessarily represent a symbol, merely part of a symbol. After this type symbol fragment we have the value of the first field, separated by whitespace. There can then follow a whitespace-separated list of zero or more fields, represented as the field name (written as if it were a symbol relative to a current namespace, even though it is only a fragment of a symbol) followed by the field's value. The number of fields must be the number of field names in the record's type symbol; records with more or less fields than they have field names are invalid syntax. For example, [if: [= x 1] then: [' hello] else: [' goodbye]] represents a record with the type symbol if:then:else: (relative to the current namespace), and fields [= x 1], [' hello], and [' goodbye], each of which is itself a record. Alternatively, [<foo/bar/baz:> 1 bam: 2] represents a record with the type symbol foo/bar/baz:bam: and fields 1 and 2. It is perfectly legal to have a symbol that happens to end in a colon as a value in a record; when appearing in a value position it will always be parsed as a value, but it would be clearer on the eye to quote it with a \ character. Vectors Vectors are pretty similar to lists - but, don't forget, IRON has no explicit list type, just pairs. And pairs are, at best, a special case of vectors. There are two basic kinds of vectors: general and heterogenous. General vectors contain lists of arbitrary values, while heterogenous vectors place restrictions on their contents. General vectors are written as #< followed by a whitespace-separated list of values, terminated by >. For example, #<a b 1 2 3>. Heterogenous vectors are usually written as # followed by a type name then <, the whitespace-separated list of values, then >. Valid type names are float, symbol, u8, s8, u16, s16, u32, s32, u64, s64, or char. The u8 and friends represent particular limited integer types - unsigned or signed (two's complement) integers of the specified number of bits. For example, #u8<1 2 3 4>, #symbol<a b c>, or #char<#'a' #'b' #'c'>. However, vectors of characters can be represented more compactly by just enclosing the verbatim character sequence in " characters, after escaping any \ or " characters in the string by prefixing them with \. Also, characters may be written as their UCS codepoints, in the form \(x) where x is the code in hex, or a series of hex codes separated with + for multi-codepoint characters. For example, the last example can be written "abc". Other types In general, other types can be represented by introducing extra syntax of the form # followed by a name then optionally (, some content, then ). Currently used names are t, f, nil, and ucs. The top level Given a sequence of characters to parse, the IRON written form parser considers the sequence to be some whitespace followed by a single value. After the value has been successfully parsed, any remaining characters will be left for later consumption if the sequence is a sequential-access stream. If the sequence is a fixed-length string or other random-access character sequence, then it is an error for anything other than whitespace to remain. Namespaces Symbols may be represented compactly in the IRON written form by reusing common prefixes of the symbol path, by declaring them as namespaces. The IRON written form parser maintains a current namespace environment, consisting of a default namespace symbol and a map from namespace names to namespace symbols. At the start of parsing an IRON written form, the default namespace is provided by the caller (and should be the CARBON name of the place where the IRON document came from, where applicable) and the map is {"fe": </argon/iron>}. Before a value is parsed, the current namespace environment is saved, and restored after the parsing of the value. The default namespace may be changed at any point in the parsing of a value where whitespace may appear, by introducing the syntax !defns followed by actual whitespace then a symbol which thereafter becomes the new default namespace. This namespace applies until overridden with another !defns declaration, or until parsing of the current value ends and the previous namespace environment is restored. Named namespace bindings may be created, or existing ones overridden, by introducing the syntax !ns followed by actual whitespace, then a namespace name, some more actual whitespace, then a symbol which is thereafter bound to that name in the namespace map. Again, this binding will remain in effect until overridden, or the namespace environment is restored by the end of the current value. Namespace bindings in the whitespace before a value in the top-level character sequence given to the parser are considered part of the value parsed, but ones found in whitespace separating elements of a record, map, list, vector, or other compound value are considered to be part of the compound value rather than simply part of the next value, and as such only "disappear from scope" at the end of that value. Shorthands The IRON data model as no inherent notion of a comment, because these are just handled as record types that wrap the value they are commenting upon. A comment is represented as a record with a type symbol of /argon/iron/note::. There are two fields, note: and :. The former is the body of the comment, and the latter is the value being commented upon. As a shorthand for this unwieldy syntax, the textual representation also supports the use of the ; character to mark comments. If the ; is the first non-whitespace character on the line, then it is considered a comment on the next value in the stream. If it is not the first non-whitespace character on the line, then it is considered a comment on the previous value in the stream. Either way, the body of the comment is taken to be the string from the ; to the end of the line. If more than one comment applies to the same value by this rule, then they are concatenated together, separated by newlines. For example: ; This is me [person: "Alaric Snell-Pym" speciality: "Vapourware" ; All I do is talk about it! ] is equivalent to: [fe:note: "This is me" : [person: "Alaric Snell-Pym" speciality: [fe:note: "All I do is talk about it!" : "Vapourware"]]] However, comments are often used for several different things; the case we have covered is actually applying notes to values, but they are often used to temporarily disable sections of something ("commenting out"). This is handled by wrapping the value in another record, with type /argon/iron/disabled::. As before, the first field is a string describing the reason for disablement, and the second field is the disabled value. Often, when something is commented out, it is temporarily replaced with another value. In which case, we can make this relationship explicit by using the /argon/iron/temp:use:: record. The first field is a string giving reason for the swap, the second field is the value to actually use, and the third field is the value to ignore. All three of the previous record types have something in common: they are intended to wrap around an existing value. This is a common pattern with records, but it can be difficult to place the trailing ] after the wrapped value if it has a large multi-line written representation and as such, we provide a shorthand representation. Any record type whose last field has the minimal name : - eg, any record whose type symbol ends in :: - may be represented as a prefix record. This consists of writing the record value, but omitting the final : field. However, the trailing ] must be replaced by ]:, which causes the next value to be read and inserted into the record as an extra field called :. For example, to disable a value, one can just prefix it with [fe:disabled: "Not yet implemented"]:, and comments can be attached to objects by prefixing them with [fe:note: "This is a silly value"]:. Closures, Continuations, and other such tricky things Closures and continuations are all eminently serialisable; both consists of a reference to the code, and a closed-over environment. However, how to refer to code is beyond the scope of this standard. As literal closures, continuations, and other hairy objects like handles to specific objects will rarely need to be typed in by hand, they do not need especially pleasant syntactic sugar, either. Therefore, such objects are left to higher levels to represent as records with appropriate type symbols. Binary form TODO: Same model as textual form, different syntax In-memory form TODO: Defined in terms of HYDROGEN. And we need to define garbage collection for it. And mechanisms for going to and from textual form, both as a SAX-like eventy interface (for reasons that will become clear later) and direct to/from objects in memory. Also, linearity becomes an issue in some way. Newly loaded values are linear, but there are make-shareable and make-linear operations.