When software failure is not an option
23 Jan 2007
Such applications exist, for instance, in chemical manufacturing plants for operating valve and heating systems or in nuclear power stations, where coolant pump and control rod drive systems are of paramount importance.
While in each application there will be a sub set of different standard processes, from the development point of view the software acts as a framework that is similar for all process functions.
‘All software has bugs’ may be a truism, but in developing safety critical software this adage is redundant and a variety of tools and methodologies are used to achieve the required degree of certainty that a particular safety function will be executed.
Cambridge Design Partnership (CDP) claims to have taken this haphazard approach to a systematic conclusion for safety critical software development. It has developed a methodology that demands the detailed and specific elimination of risk of failure, coupled with the experience to identify potential pitfall issues in deploying the system.
Using this process, programmers and software designers are able to ensure that the necessary level of integrity is maintained. In addition, businesses can be confident that effort and expense is not diverted needlessly on redundant systems.
Developing software for safety critical systems is not much different from other kinds of software development in that common tasks and good practice are applicable to all systems development. However, within safety critical software additional functions must also be performed.
The foundations of functional safety are based on two simple questions: What safety functions have to be performed and what degree of certainty is necessary that the safety function will be carried out?
Therefore initial requirements include performing a risk analysis to determine the safety integrity level. Typically Fault Tree Analysis (FTA) is used to assess possible failures and their contributing factors. Subsequently, Preliminary Hazard Analysis (PHA) — most commonly Failure Mode and Effect Analysis (FMEA) in which failures assigned a numerical score based on the severity of a potential fault — is used to identify those items that fall over the safety threshold.
The next level of detail is a Design Hazard Analysis (DHA), in which different methods of developing the necessary software architecture are examined to determine the associated level of risk. An overall system risk assessment follows by considering the results of the PHA and DHA.
Within each area of application there are certain standards that govern industry practice; EN62304: 2006, for example, covers the life cycle requirements for medical device software and defines four severity levels from negligible to catastrophic.
For many generic process industries devices are governed by IEC Standard 61508 on safety critical systems, which is specific to the functional safety of electronic systems. The standards detail the requirements necessary to reach each safety integrity level, becoming more rigorous with severity level to mitigate the higher risks of dangerous failure.
Safety critical software developmentIn order to produce reliable software to satisfy the required integrity level it is important to have a disciplined system, strict quality management processes and a well-defined, robust development process.
Dr Aidong Xu, a senior design consultant at CDP, explains: “The key item here is the disciplined process, the first thing is to know what it is supposed to do and the second is to know you have done it.”
One method of imposing a defined methodology is the software lifecycle V-model which is structured around a modular development cycle. The V-model requires each module to be written and thoroughly tested separately, before the modules are integrated systems and a further testing process in conducted.
When developing software, maintaining traceability and a clear audit trail is another key aspect. For critical applications designers have to definite in identifying the real cause of a fault and a clear trace so that it is certain the particular issue has been addressed.
A rigorous documentation procedure managing the change request process, either during the development stage or even after release, are also essential so that future revisions - that may come several years later when original developers are no longer available - can be accurately managed.
A further software audit is required when selecting a suitable system, particularly in complex processes controlled by real time operating systems. There are many real time operating systems available, most of which would probably perform, but which might not guarantee the necessary level of integrity.
Similarly, there are many different programming languages, some of which are far more structured than others. For example, ADA which is used in military applications has built in structure to ensure code integrity. “When you have the right tools in place,” says Aidong, “they act as a safeguard, they prevent you from making a mistake.”
Fail safeA thorough and structured design methodology with rigorous testing and fault analysis coupled with a detailed audit trail is a vital tool in successfully developing robust safety critical products. However, some applications call for a judgement call in order to balance the project development time and effort with the end result.
“This is one area where experience really counts,” said Aidong, adding that it is not always appropriate to be heavy handed in employing all the defined processes in developing an application, which could be less process intensive or does not require such a rigorous approach.
Ultimately, safety critical software must ensure that the safety integrity level is met and this systematic approach enables the very real risks associated with systems failure to be effectively managed. Aidong concludes: “Perhaps there is no such thing as bug-free software, but the important thing is that if things do fail, it fails in a safe way.”