CAESIUM
CAESIUM is responsible for making sure scheduled events occur. It performs a similar job to cron on a Unix system, except that its job is made somewhat more complex in a distributed environment; the job must run on exactly one node when it is due. Not zero nodes (even though any number of nodes may be offline), and not more than one nodes (even though the cluster may be partitioned due to a network failure), in spite of the fact that it is impossible to tell, from any given node, if unreachable nodes are unreachable due to node failure or due to a network partition...
As such, there will be situations in which CAESIUM cannot be sure if it should invoke a periodic handler or not. If half of the nodes in the cluster are unreachable, are they down (in which case we should run the handler somewhere within the visible cluster) or are they just partitioned off (in which case, do we run the handler, or do we hope that somebody on the other side of the partition runs the handler?).
These cases vary if the handler is bound to a specific set of nodes, as well. But we also have a special CAESIUM property of a handler stored in its metadata, deciding if in the event of disturbing network conditions, it is better to risk running the handler more than once, or risk not running it at all.
So the general algorithm I propose is this:
On every node, CAESIUM listens to node liveness notifications from WOLFRAM and keeps track of the node with the highest CAESIUM priority that is reachable for each volume. The priority is by default computed from the available CPU/RAM capacity of the node, but can be administratively overridden; in the event of more than one node having the same priority, the node ID is used as a tie breaker. So every partition of every volume in the cluster has a single distinct CAESIUM master, and every node in the partition knows who it is except when partition membership changes are being computed.
The CAESIUM master is responsible for examining the index of all CAESIUM handlers in the volume and deciding when one is eligible for invocation, by considering the schedule and comparing it to the cluster-wide synchronised time available from WOLFRAM. When a handler becomes eligible, it has to decide whether to invoke it.
It obtains the list of reachable nodes that carry the volume, and if the handler is bound to a specific set of nodes, filters the list to only include those eligible to run the handler. It then considers the size of that list, compared to the list of nodes that could in principle execute it if they were all reachable: which is the list of nodes that carry the volume, filtered by any node-binding specified for the handler.
If more than half of the nodes are reachable, then we have a quorum, and can certainly ask LITHIUM to invoke the handler for us (which will automatically result in an eligible node being chosen).
If not, we have to consult the handler metadata once more to see if it is an "err on the side of running too many times" or "err on the side of running too few times" handler. If we are asked to err on the side of running it too many times, we try and run it. Otherwise, we don't, and hope that we are in a partition with a quorum on the other side which WILL run it.