TUNGSTEN
TUNGSTEN is a distributed, replicated, transactional data store.
TUNGSTEN manages entities, which are given "entity IDs" (EIDs), unique within the cluster.
Each entity contains a number of records, each of which are either assigned names when they are created, which are then used to refer to them in future, or created anonymously and assigned unique IDs by the TUNGSTEN system. The named records are to be considered global variables within that entity, the persistent state of the entity across threads within it, while the anonymous records must be reachable from the named records lest they be garbage collected (another responsibility of TUNGSTEN).
For each entity, TUNGSTEN maintains a directory of records containing information such as their name. An entity might be, from the viewpoint of TUNGSTEN, be considered like an SQL database with many tables in it.
Since each entity is mirrored onto various nodes in the system, each record within it is likewise mirrored.
TUNGSTEN uses the distributed programming primitives of WOLFRAM to implement this distributed storage. The constructs it adds on top of WOLFRAM are fairly basic; on each node, TUNGSTEN manages all the persistent storage devices (one day, even things like tape drives - backups are the responsibility of TUNGSTEN).
At the lowest level, TUNGSTEN provides, on each node, one or more "volumes". Each volume has security constraints, as mentioned here; the TUNGSTEN system uses this to decide which entities to store on which volumes.
TUNGSTEN supports transactions, so it ties into the existing WOLFRAM/MERCURY transaction system as a transaction-aware resource. All activity occurs in the context of a transaction ID, and TUNGSTEN on a node registers itself with the transaction leader for any transaction it finds itself involved in to take part in two-phase commits or rollbacks of that transaction.
Each entity is identified by a serial number. Within each entity is then something like a little filesystem. It contains named and unnamed "byte vectors". Named ones are identified by a string name, unnamed ones by a serial number unique within the entity. The system provides primitives to create and delete named and unnamed byte vectors, and to return a new serial number known to not be being used within the entity.
Each byte vector contains a string of bytes, and the system provides an interface to read substrings of this string, or to insert or delete or overwrite substrings. Unlike POSIX files, it does allow you to (in a single operation) insert some bytes into the middle of a byte vector, and have the rest of the byte vector 'shift up' to make room.
TUNGSTEN associates an IRON type with every named byte vector. When called upon to mirror an entity to a new node, it allocates a context for that entity on the node then invokes the module corresponding to the type of each named byte vector, asking it to mirror it onto the new node. This is because different protocols are used to distribute tables, atomic values, and so on; and for each instance of the type we can also specify storage details such as expected read/write ratios, so quorums can be allocated appropriately.
Only named byte vectors need copying; unnamed byte vectors are merely used as parts of the internal representations of the object graphs starting with the named vectors as the root set, so they will get copied automatically as part of the copying of the named vectors.
Likewise, when an entity is no longer mirrored on a specific node, the TUNGSTEN system will inform the module responsible for handling byte vectors of its type that the mirror is to be desroyed. The module may then decide to alter the situation on other nodes; eg, ensuring that any information which this node currently has the only up to date copy of is replicated to other nodes before the mirror is terminated.
TUNGSTEN also relies on the library of modules to trace the object graph for it, allowing it to find out which unnamed byte vectors are actually reachable from the named ones for garbage collection.
The actual implemention of TUNGSTEN's lowest level in terms of physical disks will be done according to the principles of a log structured filesystem; the disk is divided into fixed-size segments. Old data is not overwritten; when something is changed, then a copy of it in its changed state is appended to the end of the current output segment. If that segment is filled, then an empty segment is chosen as the new output segment.
This means that the data for any given entity can be spread randomly around the disk as the addition of each bit of it to the entity was logged, and that old data that has been overwritten is lurking around in the log, but dated before the up to date information. This means access is slowed down due to fragmentation, and that space is wasted.
Therefore, a background task (that becomes foreground if there is a shortage of empty segments!) called the sweeper garbage collects by picking each entity in turn, and starting with its named byte vectors, traverses the object graph, copying the most recent data for each byte vector in turn into the current output segment as a contiguous block, potentially interleaved with new log entries of changed data being produced by the running system; it then marks the scattered log entries that it gathered the entity's state from - both out of date log entries that have been superseded by more recent ones, and the most up to date ones themselves - as unused. Any totally unused segments are then put into the pool of empty segments. The sweeper will, of course, not do this to an entity that has not changed since its last sweep, and is as such already contiguous in the log and does not have any outdated information lurking around.
As such, the system tends towards compactly stored entities. The log structure means the system can efficiently store things like single integers, without needing to allocate entire blocks for them. And it makes implementing transactions a snap, since old versions of a byte vector can be accessed read-only while a write transaction is generating a new version further up the log, just be restricting ourselves to only looking at information stored before the start of the write transaction in the log. When the write transaction commits, then reads can start to be served by the new data at the head of the log. If it aborts, then any partial new information can be left to be swept away by the sweeper.
At the top level of TUNGSTEN, CHROME code running in the context of any given entity can transactionally access persistent objects. It can request an object by the name of the byte vector containing it, and then traverse the object graph from there to reach an arbitrary persistent object, accessing them as if they were any other CHROME mutable object. However, when their transaction commits, those objects get updated as byte vectors; the CHROME class is invoked to present the instance in terms of IRON types, which the TUNGSTEN system then stores, using the appropriate distributed algorithms for the type at hand.