Implementation requirements seem to come in waves, where several different customers will come forward with the same specific requirement. Sometimes this is driven by a government RFI, so they are working from the same blueprint. Sometimes the companies might be in different industries looking at different required results, and stumble upon the same solution set requirements. One example of the different source, same result request was “How do we enable load balancing using RTI Connext DDS?”
Several recent customers had a specific need that wasn’t directly addressed by our configuration options or documentation. Thinking about the implementation requirements afterwards, I came up with a neologism that describes what they were looking for.
New Jargon: Delegated Durability
DDS supplies a Quality of Service (QoS) parameter called Durability. This is data that can persist within the system, so that it can be supplied to late-joining readers — those readers that show up after the data was written. Consequently, there is no start-up order dependency on the system.
There are four levels of persistence: Volatile (not persistent), Transient_Local, Transient and Persistent. The first two levels are managed by the original DataWriter. The last two, Transient and Persistent, use a dedicated participant on the domain (the Persistence Service) that captures Durable data, stores and maintains it, and will republish the data as necessary to any late joining readers with the same (or higher) level of Durability:
“Delegated Durability” is offloading the Durability protocol management to a Persistence Service (PS). This sounds logical, right? I mean, isn’t this what it’s there for? Well, mostly. The PS works fine and as designed when the original DataWriter is no longer available. The inefficiency occurs when the service is holding onto data from a DataWriter that is still resident and publishing.
Durability is an ordered QoS parameter. Persistent is stronger than Transient, which is stronger than Transient Local. Durable data is maintained by either the original Writer (Transient Local) or by both the Writer and the Persistence Service. Both will respond to the late-joiner with the historical data maintained when requested. When a Durable reader is enabled, it will get a copy of historical data from the Writer, as well as from each correctly configured Persistence Service. And then you also get multiple copies of everything new as it comes out of the Writer, directly from the Writer — and then redirected copies from any services.
What we’re looking for then, is to delegate the Durability protocol handling of historical data, for late-joiners, from the Writer to the Persistence Service. Any new published data should come from the Writer directly. Reliable data (and Durable data is by its nature also Reliable) can come from either the Writer or the Persistence Service.
To give some context on why Delegated Durability was chosen, we already support “Delegated Reliability”. This is where you offload the Reliability protocol management to a PS. Whenever a Reliable instance is missed due to network collision, stack buffer or other constraint, that instance will be repaired by the original DataWriter — unless you are delegating that to a Service. Because the PS has a copy of the original instance, it can supply that repair instance to the downstream subscriber without assistance from the original DataWriter.
The purpose is to allow a busy (highly loaded) machine to not worry about servicing the protocol related to reliability traffic (CPU cycles/network pipe) and still maintain the History Keep_All (memory) requirements of the strict-reliable configuration. Delegated Durability is then, by analogy, designed to allow the busy machine to not worry about servicing the protocol related to durability (historical data) traffic.
The goals are:
- On Reader enable, no bandwidth spike of multiple copies that are ultimately dropped as redundant
- No non-deterministic CPU hit on the user’s Reliable, Persistent DataWriter as cycles are burnt to coalesce the history and send it
- No injected, steady-state latency waiting for the redirected (DW->PS->DR*) traffic
While RTI doesn’t supply a single, out-of-the-box configuration (such as Delegated Reliability) that minimizes the generation of duplicates, we do provide a set of configuration values which work to directly reduce the amount of duplicated instances being sent to DataReaders. These config values are pulled in from various other things, and so what we are working here should be seen as a Great Big Side Effect, and YMMV.
In the Persistence Service, configure the PRSTDataWriters with the following added QoS values:
The push_on_write (false) field disables the PRSTDataWriters from automatically forwarding available data to known DataReaders when the original DataWriter writes something new, stopping “new” data from being duplicated on the wire automatically.
The acknowledgment_kind field tells the reliability protocol to expect existence of “virtual acknowledgements”. The Virtual Ack is part of the Application-level Acknowledgement that was added as part of 4.5f. Reliability itself only tracks that the receiving middleware has received each sent instance, there was no way to announce that the application had seen it. The virtual acks allow the Persistence Service (and the DataWriter) to track what subscribing DataReaders have received, even if they themselves didn’t send the original instance.
In your own DataWriters and DataReaders,
The samples_per_app_ack field tells the DataReader to send (for each instance presented to the application) a virtual acknowledgment. There are two ways to cause an auto-virtual ack, by sample count, or by time, we’re showing the per sample method here.
The availability fields tell the DataReader that there is a Persistence Service out there that it will need to send virtual acknowledgements to.
This minimizes the number of duplicates sent by the Persistence Service — it is not currently possible to entirely prevent them. It also increases the amount of steady state “overhead” traffic, which may be problematic. Sometimes this is the preferred behavior. One of the goals was reducing the overall latency between writes and final reception (either direct or repaired). Either the reader gets the original write directly from the Reliable Writer, or it gets its repair write from either the Writer or the Persistence Service, whichever gets there first.
Questions? Please do ask in the comments below.