During the past 30 years, several major trends have occurred with respect to safety- and security-critical systems. For example, digital systems have evolved to implement and automate increasingly complex functionality. In today’s world, these systems are everywhere. From the aircraft we fly via automated flight controls (Figure 1), to the systems that help secure our nation (such as NSA cryptography and communications monitoring), to those that protect sensitive military/business/personal information (such as battlefield command and control), these complex digital systems assist us in nearly every facet of our lives.
The functions performed by these digital systems have become increasingly software-intensive, while at the same time becoming increasingly safety- and/or security-critical. For example, a fault in an aircraft’s flight control function could lead to a catastrophic failure condition resulting in loss of human life. Similarly, a fault in a cryptographic function could lead to a breach of security resulting in a successful terrorist attack.
Due to many factors (size, weight, power, and overall cost), developers of these systems are driven to integrate disparate functions onto fewer CPUs and/or modules. In past so-called “federated” systems, each function was typically allocated to a dedicated module. In today’s so-called “integrated” systems, multiple functions of differing levels of criticality are typically allocated to the same module. For example, an aircraft’s cabin entertainment function might be hosted on the same module as its flight control function. Or a COTS network stack managing access to the Internet and nonsensitive information might be hosted on the same module as top-secret battlefield command and control functions.
As a result of these trends, it has become vital to provide brick wall partitioning between functions hosted on a single module (Figure 2). Clearly, system developers cannot allow a fault in an aircraft’s noncritical cabin entertainment function to cause a catastrophic fault in its highly critical flight control function. Similarly, one cannot allow a fault in a COTS network stack to compromise encryption of sensitive battlefield information.
Consequently, RTOSs have evolved to provide this “brick wall partitioning” and to enable these systems to reliably perform their intended functions safely and securely. On the surface, safety- and security-critical functions seem to have the same needs. However, while similar, they are distinct. Our discussion focuses on the different needs of these two system types and how RTOS technology supports them.
Safety-critical systems
In response to the trends just discussed, commercial avionics manufacturers, in conjunction with regulatory authorities such as the FAA, have developed RTCA/DO-178B guidelines for developing safety-critical software. DO-178B defines five levels of “failure conditions” where software might contribute (catastrophic, hazardous, major, minor, and no effect) and five corresponding levels of “design assurance” (levels A through E).
As the level of process rigor increases, the cost associated with developing and verifying software increases simultaneously: Software developed to Level A can cost 5 to 10 times more than software developed to the less-rigorous levels D or E. Consequently, developers try to minimize the amount of software categorized at higher levels of safety criticality. And, due to factors that are driving increasing levels of integration, they often host numerous software functions – with varying levels of criticality – on a single module.
This integration creates a special challenge: How does one prevent these different software functions – all hosted on the same CPU or module – from interfering with one another? For example, if the cabin entertainment system corrupts the flight control function’s memory containing the aircraft’s attitude (that is, orientation in space), it could lead to a “hard-over” condition of the aircraft’s ailerons, and the results could prove deadly. Process-based operating systems such as Windows and Linux are not intended for safety-critical systems, which require ironclad guarantees of noninterference and deterministic behavior.
Consequently, a new class of “partitioned RTOSs” or “p-RTOSs” has been developed in recent years. A p-RTOS guarantees brick wall partitioning of time (such as CPU bandwidth), space (such as memory), and resources (such as I/O and other physical devices), which prevent one software function from interfering with another.
To enforce time partitioning, each software function is allocated a strict budget of CPU time. Using these CPU budgets, a system timer and interrupt, and a schedule, a p-RTOS operating in privileged mode controls the order in which functions execute on the CPU and the amount of CPU time each is granted. If a function attempts to overrun its CPU budget, the timer interrupt fires, then the p-RTOS takes control of the CPU, preempts the offending function, and allows the next function in the schedule to run.
To enforce space partitioning, each software function is allocated a strict quota of its own memory (for example, RAM and stack space). The p-RTOS uses the CPU’s Memory Management Unit (MMU) to enforce this partitioning by mapping virtual to physical addresses appropriately.
Resource partitioning is achieved in a similar manner, wherein each function is granted explicit access (for example, read/write, write-only, read-only, and none) to resources (interrupts and I/O devices), and the p-RTOS enforces these access controls.
In addition to enforcing partitioning, the p-RTOS restricts which complement of software functions may be active on a given module at any one time. Specifically, each such complement must be known ahead of time, along with its associated budgets, quotas, and access controls. Without such foreknowledge, the p-RTOS would be unable to ensure partitioning.
Security-critical systems
Like safety-critical systems, security-critical systems have evolved with increasingly complex functionality and higher degrees of criticality related to national security, critical infrastructure, and sensitive military/business/personal information.
In response to this trend, government, industry, and academia have developed the Multiple Independent Levels of Security (MILS) specification. The foundation of the MILS architecture is a Separation Kernel (SK), which is an RTOS that permits multiple software functions with different development and verification pedigrees to share common resources such as the CPU, RAM, and I/O devices, but without unwanted interference.
Accordingly, ISO 14508 was created. Known as the “Common Criteria,” ISO 14508 establishes seven Evaluation Assurance Levels (EAL). EAL 7 is the most stringent and EAL 1 is the least. ISO 14508 also defines functional and security requirements for security-critical systems, which lay the foundation for an SK Protection Profile (SKPP). The SKPP provides criteria for systematically assessing an SK to ensure the SK provides a level of robustness appropriate for a given EAL.
As with a p-RTOS, an SK’s primary job is to let developers establish partitions and then to enforce those partitions during runtime. In this sense, an SK and a p-RTOS are very similar. However, there are key differences:
- An SK is primarily concerned with partitioning a system’s data resources and controlling information flow between partitions, while enforcing strict adherence to data isolation, damage limitation, and information flow policies. In contrast, a p-RTOS is primarily concerned with simple enforcement of data partitioning.
- An SK operates in a much more dynamic environment in which the complement of software functions active at any time is not necessarily known ahead of time. On the other hand, a p-RTOS allows only known complements.
- An SK must protect against malicious attacks from without and within, whereas a p-RTOS is primarily concerned with preventing unintentional faults (though this focus is changing).
While a p-RTOS such as LynxOS-178 would meet a restricted interpretation of the SKPP, thereby supporting systems with EAL 1 through 4+ (medium assurance), it would not meet the requirements for EAL 5 through 7 (high assurance). The primary reason is that meeting the highest EAL criteria requires formal methods, wherein the SK developer must mathematically prove the security properties of the SK.
While formal methods have improved significantly from their early days, it quickly becomes intractable to use them on software of even modest size/complexity (a few thousand lines, at most). Consequently, SKs such as LynxSecure are intended for the most security-critical applications and have fundamentally different architectures than p-RTOSs: Specifically, the SK core must be small and simple enough to be amenable to formal methods.
With the addition of hypervisor and software virtualization technology into the SK, an additional benefit is realized. The ability to run guest operating systems within a secure partition allows unsecure operating systems and applications to be run in the same system as secure applications.
Similar, but different
Safety- and security-critical systems have evolved with similar but distinctly different needs, and RTOSs have evolved along similar yet different paths to meet these distinct needs. Both system types require strong partitioning to prevent unintended interactions. Proving an SK’s ability to enforce security-critical partitioning requires formal methods, which, in turn, drives dramatic architectural differences from its p-RTOS (safety-critical) cousins.
LynuxWorks 408-979-4630 www.lynuxworks.com