An introduction to building collaborative finite-state machines with DDS 5

If you are a software developer, you have probably used or are familiar with the concept of finite-state machines (FSMs), or state machines for short. Wikipedia has an article about it that explains the basics. The state machine can be a powerful technique to model the behavior of a software component. However, extending that concept to a distributed environment, with multiple components participating in the machine simultaneously, is not trivial. This article is the first in a series that describes how DDS can help you building collaborative FSMs in a fault-tolerant and scalable way.

The guiding example

The state diagram below shows the simple state machine that will be used to illustrate the techniques explained in this article. It represents the possible states of a task and the possible transitions between those states. Note that at this stage, the task is an abstract concept, representing any possible task that can be executed in the system.


The simple state machine for the lifecycle of a task

As the above diagram illustrates, after he task is created, it is announced to the world. Once the announced task has been executed, its state transitions to the Completed state. When the task no longer needs to exist, it will be revoked. The task will remain available for a while in the Revoked state, so others can observe the transition. Finally the task will be destroyed. The task can also be revoked when it is still in the Announced state, indicating that execution of the task is no longer needed.

If this example state diagram is implemented in a distributed fashion, there will be multiple components collaborating to achieve the different state transitions. Our example will include two participants in the state machine: a Master component and a Worker component. Their roles are indicated in the figure below:


Collaborative task state machine with two participants: a Master and a Worker

The Master component announces one or more tasks, which are completed by the Worker component. There is no restriction on what the task is actually doing. In real life, the Master could, for example, request the calculation of a CPU-intensive problem, and the Worker could be the component executing that calculation on a different machine.

Our concrete example will concern the problem of calculating whether or not an integer is a prime number. The task announcement will include the potential prime to be analyzed. The task is completed when the worker has determined whether or not that potential prime is a prime. Not displayed in the diagram are the parameters sent along with each transition; they are (in pseudo-code):

task.announce(unsigned int potentialPrime)
task.complete(bool isPrime)
task.revoke()

Note that with the task state machine example, a single state machine will be active for each outstanding task.

Introducing the building blocks

The collaborative state machine can be implemented using multiple components participating in the DDS backbone. In the spirit of DDS, these components will be decoupled and run in an autonomous fashion. This section identifies each component.

State Machine Manager

The  application managing the distributed state machine is a separate component. Its single responsibility is to manage and publish the current state of the FSM; one state per outstanding task. There are several reasons to isolate the state management in a separate component. Most prominently, it brings a good separation of concerns. None of the other participants in the state machine need to know anything about the overall state diagram; they just need to know what transitions they can initiate. In turn, the Manager component only needs to know about the possible states and their transitions, but nothing about why these transitions get initiated.

Participants

The components capable of initiating state transitions are called Participants. Our example has two: the Master Participant and the Worker Participant. These Participants are also capable of observing the state transitions managed by the State Machine Manager. As seen in the state diagram illustration above, the Worker can only initiate the “complete” transition, whereas the Master can initiate all transitions except the “complete” transition. In order to implement the collaborative state machine, it is important to identify the different Participants and their associated transitions in upfront.

Observers

In many cases, it makes sense to support the concept of an Observer. Observers cannot participate in the state machine, but only observe the current state. They can be useful for monitoring purposes. A future post will explain a more advanced, generic Master-Worker pattern in which an Observer manages a pool of Workers without knowing anything about the nature of their tasks.

The corresponding DDS datamodel and interactions

We are now ready to define the interactions of the different building blocks with the DDS backbone. The figure below illustrates the situation:


The components of our state machine and how they interact with the databus

The blocks in the middle of the above figure are DDS Topics. If you are familiar with DDS, you know that components do not interact with each other directly; they only interace with the DDS backbone. In the above figure, arrows pointing towards a Topic in the backbone indicate that the component updates (publishes) that Topic. Arrows pointing away from a Topic indicate that the component observes updates of (subscribes to) that Topic. Each of the Topics is defined in more detail in the next sections. Note that TaskState and FullTaskState have an inheritance relationship and are displayed as if polymorphism was supported natively by DDS. It is not, but it can be simulated. For the sake of simplicity, the figure shows the conceptual inheritance mechanisms without details about the underlying DDS Topic design mechanism.

State Topics: TaskState and FullTaskState

At the top level, the current state of the state machine is published via the TaskState Topic. Components subscribing to this Topic are only interested in the actual state of the task. They do not know anything about the internal state of the task that is published via the FullTaskState Topic. Components that are actively participating in the state machine need be aware of the internal state of the task in order to initiate meaningful transitions — just like an object implementing the state machine on a single machine would only have to maintain an internal state. For that purpose, Participants have to subscribe to the FullTaskState. The figure below shows what the corresponding types would look like for the example state diagram, where the task is still the prime-number analysis example:


The TaskState and its derived FullTaskState type definitions

TaskState contains the bare-bones state of the task in the state machine. The Observers interested in this state do not need to know anything about the internals of the tasks or the semantics of the task — they just observe the tasks changing state. The TaskState Topic can be considered the Task base class Topic. It is independent of the actual type of task that the state machine is used with.

The taskId attribute is the key value that uniquely identifies the task. The seqNr attribute is an index that indicates how many updates have taken place. The state attribute is an enum that indicates state the machine’s current state.

The FullTaskState contains the complete internal state of the task. This is a type that will be different if the state machine is used with a different kind of task. As you will know by now. For this particular example, the potentialPrime field contains the number that needs to be tested for being a prime. The isPrime field either contains the result True or False, or Unknown if the task has not yet completed.

Conceptually, FullTaskState is derived from TaskState. Since standard DDS does not support inheritance between Topics, this is modeled by means of the common attributes taskId and seqNr. Components that are only interested in the base class will subscribe to the TaskState Topic. Components interested in the derived FullTaskState class will subscribe to both TaskState and FullTaskState and will have to combine the two for the full information.

Transition Topics: TaskAnnounce, TaskComplete and TaskRevoke

In order to initiate state transitions, Participants of the state machine publish transition Topics. TaskAnnounce is used to announce the details of the task to be executed and will bring the task to the Announced state. TaskComplete contains information about a completed task and will bring the task to the Completed state. TaskRevoke only contains the ID of a task that needs to transition to the Revoked state. The type definitions for these Topics are seen in the figure below:


Type definitions of Topics that communicate state transitions

These types essentially are the parameters to each state transition as indicated above, in the explanation of the state machine figure. Each type has the same key attribute, taskId, that refers to the task that needs to transition to a different state.

The FullTaskState Topic is built from the information in these transition Topics (TaskAnnounce, . The State Machine Manager is typically the only component that subscribes to these transition Topics. Whenever the State Machine Manager receives a sample of one of these Topics, it updates the current state in the machine and publishes the corresponding new TaskState and FullTaskState Topic samples. Of course, since this is DDS, you are free to subscribe to those Topics (or any of our Topics for that matter) with your own component as well — they will be automatically discovered by the backbone. This allows you to peek at what is going on in the data-space, without influencing the state machine’s behavior.

Using Transition Topics

At first glance, using these Transition Topics does not look like a natural choice. Why not have the Master and Worker components directly create, update and delete the Task when needed? DDS allows for this kind of shared object management, even with multiple components managing the same object, so why not take advantage of that feature?

The situation is not as simple as it seems. As soon as multiple Participants start updating the same Task object, required knowledge about the state machine spreads through the system. Any Participant that directly updates a Task to transition from one state to another also needs to be aware of its next possible state(s). This distributed knowledge is not desired, because it increases the maintenance burden. If a new state is introduced, multiple Participants will likely need to know.

Another problem arises with regard to consistency of the state machine. What if one component decides that one state transition should take place, while another component simultaneously decides that another transition should happen? The result is hard to predict, it could be a matter of timing. In fact, different machines could receive the updates in different order, leading to an inconsistent state among them. By making one Participant responsible for the actual state management of the FSM, these consistency problems are avoided.

In a sense, the approach of using Transition Topics resembles a class design in which the internal state and its management are encapsulated in a thread-safe way. Invoking a method on the local object is similar to sending a Transition Topic sample in the distributed system. Its internal state can be viewed by any Participant that subscribes to the TaskState and FullTaskState Topics. Consequently, the advantages of encapsulation (over exposing internal Task state access to all Participants) hold for the Transition Topic approach as well.

Finally, another advantage of having one participant responsible for the Task state, is the fault tolerance achieved for the complete state machine. In a future post, some associated DDS Quality of Service settings will be explored to illustrate this.

5 comments

  1. Nice article!

    I am somewhat new to DDS but I read about its “data-centric” features which seem to provide a robust way to manage the lifecycle of data-objects. That is to say whether an object is created, updated, or deleted, without the additional need for new types and Topics.

    Given this I am wondering why you chose to implement the TaskAnnounce, TaskComplete, and TaskRevoke as separate data-types and topics instead of relying on the built-in support for ‘disposing’ or ‘unregistering’ the data-instance which would seem to do the same thing…

    Joann

    Like

  2. Pingback: Achieving fault tolerance by replicating DataWriters « Blogs from RTI

  3. Very interesting article. I do have a question about the design.

    Why keep a separate topic for each transition? In some sense the transitions are ‘commands’ being sent to the state machine manager. It seems simpler to consolidate those commands into a single topic.

    Like

    • Hello David, glad you liked it.
      Consolidating the transition Topics into a single Topic is an option. Keep in mind, though, that each of the state transitions is accompanied by a different set of arguments. The Topic used should be able to capture those different sets of arguments.
      One possibility is to store them in an XML-string, leading to a single Topic that can be used for any possible command. This would require the applications to parse those strings, which is not ideal. Additionally, this approach has the disadvantage that it does not result in a strongly typed set of commands.
      Another option is to use a union-based type where each branch contains a different (structured) argument type. This has the downside that it hinders evolution of the state diagram. Anytime the diagram is extended with new commands, a branch is added to the union, our Topic structure changes and all participating components need to be re-compiled.
      Using one Topic per transition avoids these problems. As long as the number of transitions does not become really large, it seems the best approach in most cases.
      It might be worth dedicating a separate post to this, to give you a more complete and less pragmatic answer… stay tuned.

      Like

Submit a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s