MERCURY
MERCURY provides an IP protocol for the invocation of ARGON services, based on IRIDIUM.
A high level overview of how entities register their external interfaces that can be invoked via MERCURY is given on the Entities page, but here we consider some detail.
Every entity is identified by an Entity ID. The ID consists of the Cluster ID, which specifies the physical network addresses of all the cluster nodes that are configured (in the cluster entity's node database) to accept MERCURY requests, along with cryptographic keys (discussed later), a version number (for the cluster ID can change as nodes add and leave the cluster, or the crypto keys are renewed), the number assigned to the volume containing the entity, and the number of the entity within that volume.
Within the entity is stored a list of protocols provided by the entity, along with information about how those protocols are to be implemented - references to the code that will handle those protocols, rules that can be used to handle requests without awakening the entity code at all, and access control lists.
However, an entity ID may also be tagged with a persona, as discussed on the Entities page. In which case, the persona field in the entity ID specifies an alternate set of protocols to provide, and an arbitrary object to be passed to the endpoints invoked via that persona-tagged entity ID along with their usual parameters.
Protocol definitions
A protocol definition document defines a protocol, for the benefit of both the client and the entity providing the protocol. The list of protocols provided by an entity specifies a reference to the definition document for each protocol.
The protocol definition document specifies zero or more endpoints within the protocol, and information about those endpoints. There are certain compulsory items of information about each endpoint, then an arbitrary dictionary of other information, for future expansion and optional values.
The structure of the document is a dictionary mapping endpoint names to endpoint descriptions.
The endpoint description constains a unique endpoint number, which is used in the wire protocol to save having to ship strings around as endpoint names, and allows efficient implementation of dispatch lookup tables rather than having to do string comparisions.
It also contains an endpoint type. The available types are:
- Asynchronous message sink, parametrised by a dictionary mapping parameter names to parameter types.
- Imperative RPC endpoints, which are parametrised by a dictionary mapping parameter names to parameter types, and a dictionary mapping return value names to return value types.
- Idempotent RPC endpoints, which are identical to the above except that they represent operations retrieving information from the entity, rather than performing an action; as such, the system is able to cache results and other such useful things. As well as the parameter and return value type dictionaries, this endpoint type can also have one or more cache rules specified; basically, patterns that match certain parameter value ranges, with associated return values to be returned if that pattern matches. This way, the system can respond to invocations of the endpoint without actually having to load and compile any entity code whatsoever; also, the cache rules can be broadcast out to proxy servers to handle requests outside the cluster itself. This is only possible if the endpoint ACL allows public access to the endpoint, or if the proxy is trusted to process the ACL faithfully; either way, the cache rules are signed by the cluster, so the client can check their authenticity. Also, when the endpoint is actually invoked, the handler may opt to return either a specific return value (applicable to those precise parameter values only), or a general rule that covers the particular parameters and others; in either case we now have an extra rule that can itself be cached by the system.
- Connection request RPCs, which have the same parameter and return value type dictionaries as per imperative RPC endpoints, but also have a reference to a connection protocol definition document. This endpoint is used much like a normal RPC, except that the caller must provide a handler object to handle all the endpoints at their end of the connection; the server may either reject the connection request, or return by specifying a handler for its end of the connection. Note that unlike TCP connection initiation, the connection setup call lets the caller send parameters with the connection setup, and the connection can be refused on the basis of incorrect parameters, or insufficient access control rights; the connection setup handshake does not need to happen before authentication. Connection request endpoints have an extra required parameter: whether the connection will require the ordering of requests to be preserved by the network.
Each endpoint also has QoS requirements listed in the protocol description. A recommended drop priority is available for asynchronous message sinks. All endpoint types can have delivery priorities. And minimum encryption levels may be specified here.
Within the entity, when the protocol is declared within the Mercury interface object, the entity may override the QoS or encryption levels from the protocol definition document; however, this may mean that clients attempt to communicate with insufficient QoS or encryption, and the server node's protocol stack has to impose a higher QoS requirement on the reply than the request, or ask the client to negotiate a larger encryption key and use a higher encryption level. The entity must also store access control lists - ACLs - for each endpoint in the interface object.
Some underlying networks (unlike the current Internet) may allow for reserving bandwidth. If bandwidth reservation is requested as part of the QoS level, then an AURUM budget object may be required, to be charged for utliisation quotas, or for actual currency to pay for network services.
Also, within the entity's interface object, it has to provide CHROME code to actually handle incoming activity on the endpoint.
Connection protocol specifications
MERCURY connections all have specified protocols that list the endpoints each end of the connection must provide to the other.
This connection protocol specification is referenced in the declaration of the endpoint that is used to request connections, so that both client and server ends of the connection are speaking the same language (although they may well speak different versions of it).
A connection protocol specification document looks a bit like a normal protocol spec doc; however, there are two seperate endpoint dictionaries, one for those provided by the end that requests the connection, and the other for the end that hosts the connection request endpoint.
Each endpoint has the usual name, unique numeric ID, type, QoS, and security level specifications, plus arbitrary other metadata.
When a handler object is created by client or server to handle each end of a connection, handlers for all of the endpoints applicable to that end of the connection must be provided. Unlike entity protocols, however, no ACL is needed; only the entity at the other end of the connection can use the endpoints, so access control is a moot point. Again, higher QoS and security requirements than specified in the connection protocol definition can be set up by the actual handler.
Bandwidth reservation
When a connection request endpoint is invoked, the caller may request a bandwidth allocation. The bandwidth required for each direction must be specified seperately!
However, note that the connection protocols may contain connection request endpoints. Connections created within connections share their parent's bandwidth reservation, even if they reserve bandwidth themselves!
Security
The cluster ID has two key hashes in it; one for a cluster master public key, and the other for the regular cluster public key.
In a single IRIDIUM message request away, any MERCURY node in the cluster can be asked for the actual public keys, for they are stored in the cluster entity, along with the regular cluster private key. The master cluster private key is stored offline; the regular cluster public key is signed with it.
If the returned regular public key does not match the hash in the cluster ID the client has, either somebody's trying to spoof the cluster, or the regular key has been changed. In which case, the master public key must be requested, and compared with the has in the cluster ID. If it doesn't match, then we must abort, since there's clearly something horribly wrong. If it matches, then the master public key is used to check the signature on the new regular public key.
The cluster public key can then be used to encrypt messages to the cluster, in order to ensure that only the cluster receives them. It can also be used to encrypt a Diffie Hellman key exchange in order to negotiate a session symmetric key for encryption of later communications.
The client may be asked by the server to sign a challenge with its own cluster's regular private key, in order to prove the identity of the client.
See my blog article on ARGON cryptography for more information on how encryption is handled.
Transactions
ARGON transactions are distributed, and nestable.
In order to implement distributed transactions, every MERCURY endpoint invocation must carry within it a list of all the nested transaction IDs that the invocation context is in. This is required so that the endpoint handler may return correct information when accessing data objects that have been modified in a transaction, but not comitted yet; and so that update operations are associated with the transaction.
Since transactions nest, there is a list of transaction IDs, in nesting order. Each transaction ID is the physical network address of the node on which that transaction was created, and a numeric transaction number. These can be used to communicate with the manager of that transaction, which is a part of the MERCURY implementation running on the node where the transaction was created, about that ttransaction.
If a transactional resource is modified in handling a request, then a local transaction is created by the resource. This local transaction is then associated with the ordered list of transaction IDs, and the resource must contact all of the nodes listed in the transaction IDs and register itself with them to take part in two phase commit.
If a transactional resource is read from in a request, then any local transaction associated with the nested transactions is looked up to see if the information being read has been modified in this transaction; in which case, the uncomitted new value is returned rather than the 'true' value.
When a transaction commits or aborts, the transaction manager on the node owning the transaction contacts all transactional resources registered with that transaction, and performs the two-phase commit protocol.
Some operations are irreversible; these operations should generally refuse to work at all if invoked within a transaction. However, this can be assessed on a case by case basis. Some operations may have to register custom rollback actions; the action of printing a log entry to a line printer can be rolled back by just printing a new line retracting the old line, in many cases.
If a nested transaction commits, then the changes in that transactions are still not made permanent; they are merely pushed up into the parent transaction. Only when a top level transaction commits is the global state actually changed.
All of this transactional resource registration and two-phase commit is provided by MERCURY on top of the IRIDIUM message transport.
Locking
Where you have transactions, you also need locking to ensure ACID properties, where they are required.
Transactional resources may require locks to access them. Depending on the resource, there are two locking techniques that can be used - optimistic and pessimistic.
Either way, locking is technically advisory; transactional resources are not bound to use struct locking. In the case of a resources like TUNGSTEN data storage, each stored object may be annotated with transactional isolation level requirements, so that locking can be disabled if required.
As well as providing isolation between transactions, locking is also necessary for TUNGSTEN to manage distributed storage. If ten copies of an object exist on ten nodes, then one must establish read and write quora to ensure that reads produce the most up to date information, while writes are not lost, unless such protection has been explicitly disabled for that object.
Pessimistic locks
If a resource is to be locked pessimistically, then before any access is allowed, an appropriate lock must be held. If the resource is already locked, then the new accessor must wait for the existing lock to be released.
This causes delays which are not always necessary, and can also cause deadlocks, where two processes each hold a resource locked, and are waiting for a lock on the other's resource. The distributed transaction manager must examine the cluster's wait-for graph in order to locate deadlocks, and then send one of the participants in the deadlock a message that causes the blocked operation to throw a transaction restart exception. This exception is caught by the statement delimiting the transaction, and causes it to roll back and start again.
The choice of which transaction to roll back should be made based upon how much work that transaction will need to redo. If a transaction has irredeemably used up some resource (printer paper, energy, etc), then it should not be restarted if some other transaction can be instead. Therefore, some integration with AURUM would be helpful in the selection of a transaction to undo.
Optimistic locks
Optimistic locking, on the other hand, allows accesses to go ahead without first claiming a lock. Instead, a cluster-wide monotonic unique timestamp (provided by TUNGSTEN) is assigned to each transaction upon creation. Every resource keeps track of the most recent transaction timestamp that has comitted a modification to the resource. If a read request comes in that has a transaction timestamp older than the most recent commit timestamp, or a transaction with an older timestamp attempts to make a change, then the transaction is restarted with an exception as above, and started again.
Systems in which there is rarely contention are more efficient with optimistic control, since there are no deadlocks and no waits for locks. However, there is a high cost in restarting a transaction whenever there IS a collision, which is why we provide both mechanisms. One size does not usefully fit all.
Quora
The objects that locks are held on are generally distributed over several nodes. As such, to meaningfully hold a lock on such an object, you must hold the lock on enough nodes to ensure that no conflicting lock can also be held.
For example, with just read and write locks and ten nodes, one could say that a read lock must be held on at least three nodes to be valid, while a write lock must be held on at least eight. That way, several read locks may happily coexist - even if the three randomly selected nodes to take each lock out on do or don't overlap. But there is no way for two processes to seperately hold a read lock and a write lock, since they would have to collide on at least one node, which would cause the lock request on that node to block.
There's nothing magic about 3 and 8 - they just have to add to one more than the number of nodes. However, the ratio should be chosen depending on the expected transaction load, with the constraint that the write quorum (since write locks are exclusive) must be one more than half the number of nodes. If we expected a lot of writes compared to reads, we could have a write quorum of 6, and a read quorum of 5; then we only need to lock 6 nodes for each write, as opposed to 8, at the cost of making reads require two more locks.
Therefore, each transactional resource needs to decide on appropriate read and write quota sizes, depending on the expected read/write ratio. And it needs to deal neatly with the number of nodes changing at run time...
Inter-cluster transactions
After much reflection, I've decided that it's probably not worthwhile allowing transactions to occur between clusters. The problems that would need handling include:
- The potential for DoS. An open transaction needs to store a log of all writes, to make the writes "actual" when the transaction commits. It could be easy to keep opening transactions, doing as many writes as the protocol allows, then holding them open for prolonged periods, to use server space. Not to mention holding locks open.
- The need for the server to maintain IRIDIUM connections with every node that is a transaction master. Since transactions can nest, this may involve registering with an unbounded number of transaction managers, while within a cluster the number of other nodes to deal with is bounded.
However, this in turn opens potential problems; interfaces that use a transactional state model then can only be used in-cluster, which damages some abstractions. Perhaps MERCURY needs to not provide transactions at all, and transactions only be provided "in-entity", for entity code to access distributed TUNGSTEN data.
Dynamic bindings
CHROME provides a dynamic binding environment, for programmer convenience. This is a bit like POSIX environments; you "export" some environment variables in a shell script, and processes inherit them. When a request is made to a MERCURY endpoint, the request contains a context ID, which consists of a node address and a context number. If a dynamic binding variable is looked up in the handler, but no binding of it is found in the environment stored there, then the lookup is passed over the network to the node specified in the context ID. The context number and the name being requested are included in the request.
When a call is made to a MERCURY endpoint, the context ID is set to reference the dynamic environment of the caller. Incoming lookups will be handled by looking the binding up in that environment.
If a handler does not add anything to the environment before calling another handler, then the existing context ID can be passed straight through, rather than creating a new one that just ends up calling the old one to find anything. Otherwise, a new local context is created and used that looks stuff up in the local bindings, and otherwise passes the request back up the call chain.
This is used to carry extra information along with requests, much like transaction IDs. In particular, it can be used to look up the identity of a user agent requesting an action (all requests coming from user agents dynamically bind UserAgent to their own entity ID) for auditing or feedback purposes; and it can be used to allow access to resource budgets.
Hash cash
Optionally, a MERCURY request may have HashCash attached.
Certificates
Optionally, a MERCURY request may be signed by a specified digital certificate.
Interaction with the scheduler
An incoming MERCURY request causes the creation of a new thread to handle the request.
As discussed in my blog article on CPU scheduling, this thread's scheduling parameters should be derived from the desired allocation of system resources to the caller. As such, a set of administrator-defined ACLs from the cluster entity is checked to see which resource group the caller falls into, and the thread parameters set as such.
Access Control Lists (ACLs)
Access control lists are used in various places. They are used to say who can and cannot invoke MERCURY endpoints. They are used to assign scheduling parameters to handler threads.
An ACL consists of an ordered list of rules, each with an associated action. If the rule matches the incoming request, then the action is performed - either ACCEPT or REJECT. If no rule matches, then the ACL's specified default action is used.
The rules can be:
- A specific client entity ID
- A specific client cluster ID
- Is an entity in the same volume as me
- Is an entity in the same cluster as me
- The CARBON name of another ACL document to check against
- The request has at least a specific number of bits of HashCash attached.
- The request is signed by a certificate. The entity ID of the certificate matches an embedded ACL. This is used by entities issuing a certificate that allows, for a limited period, another entity to act on its behalf.
Implementation
All of this messaging is acheived atop IRIDIUM. Where possible, communication is done without setting up a virtual circuit. However, if a security level is required, then a VC is established to perform the key exchange within.
Also, any actual connections set up each get their own virtual circuit - except connections opened from within connections, which share the parent connection's VC.
When attempting to contact a cluster, a node from the cluster ID is either picked at random, or the list of node IDs is sorted by some estimated metric of "network distance" and the top one picked.
If the request fails, then a different node is tried. This is repeated until all nodes have been tried. Then we give up.
In every request to a cluster, the cluster ID version number is sent in the request. If the cluster's node list has changed, then the remote node's reply includes a new updated cluster node list.