Integrating High Availability and application management for the warfighter

For designers and developers of any embedded, distributed computing system, building and maintaining a cost-effective and extensible application and management infrastructure is a major challenge. This challenge is even more acute when the system is mission-, life-, or safety-critical and carries a High Availability (HA) requirement – meaning 99.999 percent uptime. In the military environment, these kinds of systems are typically command and control systems, weapon systems, and others that require near-real-time or real-time performance.

Dealing with this twin challenge of an efficient application and management infrastructure while also addressing possible HA requirements is important when building new mission-critical defense applications. It is also a major factor for designers focused on integrating different types of applications (new, legacy, and third party) into a common infrastructure. In either case, a unified and consistent infrastructure with a well-rounded set of services is crucial to meeting the High Availability challenge.

This consistent infrastructure can be broken down into three categories:

Application Management and High Availability – The core capabilities for health monitoring and fault recovery.
Management Infrastructure Services that include consistent configuration modeling, administrative support, alarms/notifications, and logging capability. To fully address the needs of a mission-critical system, these services should frame the solution for the middleware infrastructure and the applications.
Application Infrastructure Services includes messaging, event distribution, and access control services.

Marrying High Availability software technology with these sophisticated application and management infrastructure capabilities is at the heart of the solution to this multipronged challenge. The Service Availability Forum (SA Forum) specifications were created to satisfy the HA requirement, and SA Forum-based technologies have been successfully applied and deployed in A&D and telecommunications systems. Figure 1 provides a high-level overview of these SA Forum-defined categories within the context of a simple distributed system common in a mission-critical defense application or system.

**Figure 1:** An overview of application and infrastructure management married with high availability

An integrated High Availability and application management framework can address the aforementioned challenges.

Application Management and HA – The core challenge

Not all projects require High Availability at the outset, but application and management infrastructure is common to all embedded, distributed systems. As High Availability is a potential future requirement because of changes in mission, hosting of new applications, or broader integration and dependencies within the Global Information Grid, system designers need to be well poised for the migration with all the key architectural pieces in place.

Accordingly, Application Management and High Availability refers to the general ability to model and manage a set of processes, their life cycles, dependencies, and composite state that together make up a mission-critical system. This is best accomplished as a configuration rather than an implementation exercise. This kind of modeling and process management is absolutely essential to meeting the 99.999 percent uptime requirement that warfighters need in their systems.

Today’s aerospace and defense systems are often a diverse collection of third party, legacy, and new applications, with engineers focused as much on integration as on traditional application development. The power of SA Forum-based Application Management services is that they bring engineering discipline to this otherwise potentially ad hoc integration effort. It imposes modeling and architectural consistency that controls application behavior – without requiring modifications to code.

As shown in Figure 2, Application Management and High Availability services provides a consistent method of managing a diverse set of mission-critical applications. Application startup and runtime dependencies, state representation, administrative control, and health monitoring are all key to the solution.

At the heart of the solution is an application modeling framework that allows a system designer to craft an XML description of processes, called components, and their relationships and dependencies that make up a system. The application management model consists of a set of objects assembled by the designer to reflect the planned deployment, and it is used by the middleware to instantiate such a system at runtime.

Key ingredients associated with Application Management and High Availability include:

Health monitoring policies – These serve as a means to monitor and understand component heath with associated recovery policies (for example, application restart or failover). This does not require code changes for legacy or third party applications – crucial in today’s long defense program life cycles. Many defense systems are black boxes that do not allow for open access to the code, but faults must be detected and isolated to these components to maintain overall system availability.
Availability Management Framework – Provides a framework to coordinate all the redundant resources in a distributed environment with no single point of failure. This is the heart of addressing the HA requirement that is at the foundation of many defense systems.
Life-cycle policies – How, when, and where to instantiate and terminate components. This is required to honor dependencies and manage fault conditions.
Runtime state – Manages the presence, readiness, and operational state of a component.
Administrative control – An administrator can, on demand, instantiate, engage, disengage, or terminate a collection of components that makes up a service or an entire node in any combination to support maintenance scenarios.
Service location policies – A service is defined as a collection of one or more components. Service location policies explain which nodes a particular service can live on and ranks them so that if a node fails, alternate nodes are already identified.

In sum, application management and HA and its associated system modeling provide the system designer with a flexible framework to express all dependencies. Since application management and HA own the life cycle of these components, it can instantiate and terminate processes, as needed, to suit the startup, shutdown, and runtime circumstances. The system model also addresses basic fault conditions and recovery actions mandatory in mission-critical defense applications. Furthermore, because all runtime actions are driven by configuration policies, the middleware performance is highly deterministic. This determinism is critically important to the test and evaluation cycle of systems today and ensures that the system can become approved for production as early as possible – and deployed for use by our warfighters.

Management Infrastructure Services – Simplifying maintenance and operations

Management Infrastructure Services refers to a set of services that provides a consistent approach to represent and implement configuration, runtime state, alarm, notification, and log information that allows automated management by middleware or by an operator to manage a deployed system. In a real-time sensitive scenario such as a weapons system, these specific capabilities are integral to the middleware itself to ensure operation at the 99.999 percent availability required to maintain warfighter support.

A key overall point about these services is that ideally the rest of the system will use these same services to expose its own instrumentation, log, and event information. By adopting and leveraging these same services into the applications and system infrastructure, the value of the integration effort and the architectural consistency are both simplified and enhanced, and a more reliable design is achieved.

Configuration, notification, and log services are fully integrated with the other aspects of the middleware infrastructure. Adapters can be written to integrate the Management Infrastructure Service APIs to support external management entities such as a CLI or SNMP Manager. With distance support becoming a significant requirement in defense systems today, remote access via SNMP, Web, and so forth to the management infrastructure is important. Notification and log services are useful for online and offline diagnostics so that corrective action can be made to a system by an operator to return a crucial defense system to health as rapidly as possible.

Key concepts and capabilities associated with SA Forum-based Management Infrastructure Services include:

An XML-based means to define new application objects and attributes and an Object Management API for a client to invoke requests. It provides exclusive object access support as well as transaction semantics when multiple objects must be changed in a single operation.
Successful configuration changes are persisted by the infrastructure: The infrastructure can use the default configuration or last-known configuration at boot time.
A Notification API allows applications to generate alarms as well as state change, attribute change, and other notifications. Events are automatically logged on a log stream.
A Log API allows applications to define their own log streams or join a predefined log stream or one invented by a designer.

In sum, these Management Infrastructure Services not only frame the solution for the middleware itself, but can also be used across the full range of applications in a distributed, mission-critical defense system.

Application Infrastructure Services – Providing critical communication

Application Infrastructure Services refers to a set of basic communications and coordination capabilities. They are ubiquitous and universally relevant services used among cooperating, collaborating embedded distributed applications: intra-cluster messaging services, a checkpoint service to inform standby processes of the active state, and a distributed lock service to manage access to critical resources. This particular set of capabilities is crucial to addressing the 99.999 percent High Availability challenge in the highly diverse and distributed systems that are becoming more common in today’s networked battlefield.

The core of Application Infrastructure Services is a set of SA Forum-based services that works together to provide foundational capabilities such as an interprocess messaging service that supports point-to-point, point-to-multi-point, and publish-subscribe many-to-many messaging between distributed processes within the scope of a cluster. Additionally, a checkpoint service allows processes in an active state to push state information to its assigned standby process so that it is a hot standby allowing for stateful failovers. Not only that, a critical resource access-control service is provided so that competing applications can methodically access (one-writer-at-a-time) resources.

Mission assurance is improved by this robust application infrastructure. These communication and coordination functions are increasingly important, not only to critical applications, but to managing workflow that could be jeopardized by faults throughout the execution process. The combination of these application infrastructure services with the management infrastructure services described creates a powerful suite for designers of mission-critical systems.

Framework delivers HA to mission-critical apps

All the SA Forum services described in this framework concept share a common core of architectural features that will deliver benefits across an entire design. Most importantly, the risks of application downtime and/or critical problems are dramatically reduced for mission-critical defense applications. Furthermore, these must-have High Availability capabilities are cost prohibitive to build, test, and maintain internally. However, using a standards-based COTS solution preserves scarce resources.

The specific framework and services presented herein are included as a DoD-wide mandated standard in the DoD IT Standards Registry (DISR).

Mike Houston is Director of Marketing at GoAhead Software. He has more than 15 years of experience in the software and telecommunications industries. He can be contacted at [email protected].

GoAhead Software

425-301-5131