WOLFRAM
WOLFRAM provides distributed control and data sharing primitives. It is used purely within clusters, and uses IRIDIUM as the underlying transport protocol. It uses a different UDP port to MERCURY in order to make it easy to protect a cluster's internal communications behind a firewall while still allowing MERCURY access to public services.
It is used by TUNGSTEN in order to provide the distributed access to replicated entity state, and can also be used by parallel programming toolkits one day in order to allow things like distributed large-array operations. It also provides cluster health monitoring services in order to notify various parts of the system of varying connectivity between nodes.
The core of WOLFRAM requires a list of connections between nodes within the cluster. This is stored in the cluster entity itself, since it is global configuration information for the cluster.
The format of this information is a list of the nodes themselves, including their globally accessible network addresses (eg, IP numbers), then a list of communication groups that contain a number of those nodes along with various properties of that interconnection. All nodes are automatically members of the 'Global Network' group, which is considered slow, lossy, bandwidth starved, and insecure. If a subset of the nodes happen to sit on a fast LAN together, then that LAN should be a communication group containing those nodes with appropriate properties. If two of the nodes are connected by a serial cable, then that cable should be a group with only two members - the connected nodes. The cable will be highly secure, low latency, low loss, but bandwidth starved. WOLFRAM will know to only use it for heartbeats or as an emergency interconnect should the network go down.
WOLFRAM uses this information in a myriad of ways. It uses it for routing intra-cluster traffic efficiently, particularly heartbeats. It will take a skipped heartbeat on a serial link more seriously than a skipped heartbeat across the Internet. It will use encryption when communicating across the Internet, but not across a trusted serial link or LAN. It will use the overlapping groups to construct an efficient spanning tree for heartbeats to keep all nodes within the cluster aware of the liveness of other nodes, and to reconfigure that tree if a non-leaf node drops out.
WOLFRAM is a superset of MERCURY. It provides the MERCURY protocol for inter-entity communcations within clusters; it replaces the authentication and encryption side of MERCURY with WOLFRAM's own shared secret based security, and adds some extra fields to requests so the handler runs with the priority of the caller. It doesn't need to do the stuff MERCURY does with the lists of node addresses in the cluster ID, either, since the clustering stuff is handled by WOLFRAM itself.
However, it also adds a load of distributed algorithms. By using the node database replicated to all nodes by TUNGSTEN, it can define groups of nodes to take part in an algorithm in terms such as "All nodes mirroring a particular volume", "All nodes trusted above level X", and so on.
WOLFRAM provides for a given piece of code to be run within the context of the calling entity on any subset of the nodes mirroring that entity. This is what TUNGSTEN uses, and it can be used to implement parallel algorithms, with lots of nodes performing processing on different parts of a large data set within an entity. This is a reliable multicast protocol.
WOLFRAM is what knows when a node has become isolated from the cluster due to network partitioning, or when a node is restarting and rejoining the cluster, or joining the cluster for the first time. It will broadcast these events to other node subsystems, like TUNGSTEN. TUNGSTEN in particular will need to know about this kind of stuff so it can make sure that it keeps mirrors of volumes on enough active nodes. If all the nodes mirroring a particular volume start going offline, it had better elect other nodes to start taking copies before all the originals are gone. It will also need to know when the node reconnects to the cluster (thanks to the network being fixed or a broken node rebooting) so it can bring its local data up to date.
WOLFRAM provides a cluster-wide monotonic clock, using Lamport's algorith, plus an approximate real time clock, using something like NTP running over the ping messages, which nodes in the cluster exchange anyway to measure each other's liveness.
WOLFRAM will be hard to implement. Not theoretically tricky - the relevant theory can be plucked from the literature - but those algorithms are convoluted and hard to debug. And we will need lots of them.
Some of you may be complaining that Wolfram isn't an element. It is. It's another name for Tungsten, and is the preferred name for that element according to the International Union of Pure and Applied Chemists. Note that the chemical symbol for Tungsten is W. There is clearly a strong relation between WOLFRAM and TUNGSTEN. TUNGSTEN is a great name for a distributed data store since it implies hardness and reliability in the face of adverse conditions, Tungsten having the highest melting point of any element, while WOLFRAM has the same connotations chemically - but it's a distributed programming toolkit, and there is a tradition of naming those after things involving wolves, due to a seminal work on clustering featuring dogs on the front cover. Microsoft's offering is called Wolfpack, for example.
The other users of WOLFRAM than TUNGSTEN will, one day, include a parallel programming toolkit; large array operations could be implemented by using WOLFRAM to initiate parts of the array operation on every node carrying a part of the array, and so on.
Implementation
At heart, WOLFRAM uses IRIDIUM. Because the master list of nodes is available to all nodes, in the cluster entity, and because there are cryptographic keys stored in there for each node security level, communication from node A to node B will always be encrypted with the key for the lowest security level of the two. Since every node knows the key for its own security level and all those beneath it, the recipient will always be able to decrypt what the sender has sent.
WOLFRAM builds and maintains a spanning tree from the node database in the cluster entity. This spanning tree is stored in the cluster entity.
Neighbours on the spanning tree send each other 'ping' messages at regular intervals. If no pings arrive from a peer for too long, then the tree is considered potentially broken, and the tree formation algorithm is restarted.
During tree formation, nodes ping other nodes in the cluster and establish each other's relative reachability, including gauging the quality of the link. Information about the status of the network from before the algorithm starts is used as the basis for these statistics; if a link has just been marked potentially broken, then the system will attempt to find a node that can contact the missing node, in order to restore the link over a different path because sometimes the Internet is partially partitioned, with pairs of nodes becoming isolated from each other but not from the network as a whole.
However, sometimes the tree is indeed rent asunder. In which case, both sides of the partition will find that lots of nodes now appear to be unreachable. Two small spanning trees will then form.
However, when a distributed lock is required, the quorum is still calculated base upon the actual size of the cluster; if either side of the partition is too small to establish a quorum, then both sides might become read only. To see why this is the case, consider the network being split into three partitions; can any partition tell it is not just a lonely third, with two thirds of the cluster remaining on the other side? No. So it cannot tell that it is safe for it to establish a quorum on anything.
In the event of a disaster, the system administrator may connect to one of the partitions and explicitly delete the unreachable nodes from the cluster. This is made possible by the fact that the node list does not need a lock to be modified, so will not suffer from the inability to form a quorum. At which point, the partition can now host its own quora. However, when the partition is regenerated, the orphaned partition will not be able to resynchronise, due to the nodes not being recognised by the cluster. They will need to be re-added to the cluster from scratch, and their mirrors rebuilt.
Reliable multicast is achieved by using the spanning tree, too. Partioning of the network is accounted for in the multicast protocol to provide Extended Virtual Synchrony.
In order to keep the monotonic logical clocks synchronised, every node has a 64-bit logical clock counter. This counter is incremented every time an event that needs a logical timestamp occurs on a node, and the new value used as the event's timestamp; therefore, no two events on a node get the same timestamp. In the event of a WOLFRAM message being transmitted (even a ping), the timestamp of the message send event is stored in the message. When a message is received, the timestamp of that event is one plus the maximum of the timestamp stored in the message and the current logical clock on the receiving node; the local logical clock counter is set to this value, to ensure that subsequent events on this node get later timestamps.
However, events in two different nodes may then be given the same event ID. In order to define an arbitrary ordering on these events (and it can be arbitrary, since any activity between nodes is dictated by the rules about advancing the local clock on message receipt, so between nodes all events are ordered causally), we tie break by ordering on the node ID number where the event occured if the logical timestamps of two events are the same. Therefore, whenever a logical timestamp is stored somewhere, the node ID responsible must either be stored with it, or implicit (in the case of a message coming from a node, for example).
Distributed locking
WOLFRAM must provide a distributed locking system within the cluster, for TUNGSTEN to manage the distribution of storage.
As discussed in the locking section of the MERCURY page, we want to provide both pessimistic and optimistic concurrency control, with adjustable quora sizes.
We merge both optimistic and pessimistic concurrency control into a single notion of a lock, even if that lock is quite virtual in the optimistic case! The operations on these lock objects are:
- Lock getPessimisticLock ( TransactionID transaction, Group volume, byte[] resourceId, int quorumPercent, boolean exclusive)
- Creates a pessimistic lock object, and then claims it in the name of the caller, blocking if necessary. The volume parameter refers to a group of nodes, the group which are eligible to take part in the quorum; for an exclusive lock, floor (quorumPercent * sizeof(volume) / 100) + 1 of the nodes in the volume need to be locked. For a non-exclusive lock, sizeof(volume) + 1 minus that number are required. The resourceId is not interpreted in any way by WOLFRAM, but is used to identify the resource being locked.
- Lock getOptimisticLock ( TransactionID transaction, Group volume, byte[] resourceId, int quorumPercent, boolean exclusive)
- Creates an optimistic lock object, which really just involves recording the parameters in a structure and returning that.
- void releaseLock (Lock l)
- Releases the lock. This does nothing for an optimistic lock, and for a pessimistic lock, releases the lock on all the nodes it has been claimed on.
- void checkRead (Lock lock, MonotonicTimestamp lastModified)
- For a pessimistic lock, does nothing. For an optimistic lock, checks that the last-modified timestamp of the resource the lock is associated with is less than the timestamp at which the transaction was created, and if not throws an exception to restart the transaction that locked the resource.
- void checkWrite (Lock lock, MonotonicTimestamp lastModified)
- For a pessimistic lock, throws an exception of the lock is not exclusive. For an optimistic lock, checks that the last-modified timestamp of the resource the lock is associated with is less than the timestamp at which the transaction was created, and if not throws an exception to restart the transaction that locked the resource. The caller is responsible for then setting the last modified timestamp of the resource to be the creation timestamp of the transaction.
- Group getQuorum (Lock l)
- Returns the group of nodes that the lock is actually held on.
A typical resource such as a shared data structure would handle an update request by gaining an appropriate lock object, then multicast to the obtained quorum of the lock to get them to update their local copies of the data. The handlers on each node will then register with the transaction's transaction manager so that they are told when the transaction commits or aborts, and then keep a note of the requested change against that transaction ID. Subsequent reads with the same transaction ID will return the stored "new" value, until the transaction either commits or aborts; if it commits, then the new value becomes the true value.