A probabilistic risk assessment toolchain based on dynamic fault trees and reward event trees

Probabilistic risk assessment (PRA) is essential to keep e.g. our power plants, trains, drones, medical devices, satellites, and self-driving cars safe and operational within acceptable risk levels. PRA usage is required by e.g. the Federal Aviation Authority (FAA), the Nuclear Regulatory Commission (NRC), in ISO 26262 for autonomous driving or for software development in aerospace systems (by NASA and ESA).

PRA encompasses two steps: risk quantification - through static fault tree analysis (FTA) -, and consequence quantification - through event tree analysis (ETA).

Though highly adopted formalism, static fault trees are very simple formalism – equivalent to boolean logic – and therefore cannot faithfully model fail-operational/fault-tolerant systems having redundancies, (probabilistic) functional dependencies, and temporal ordering among failing components. Similarly, traditional event trees cannot model the decision-making – might be required to mitigate the impact of e.g. accidental events –, and quantify measures like no of casualties, monetary loss, etc.

In collaboration with Twente and RWTH Aachen universities, we developed a fully automatic, scalable, and state-of-the-art tool SAFEST that can faithfully assess the risk of fail-operational/fault-tolerant systems. It goes beyond the capabilities of existing commercial tools in offering more flexible modeling and analyses. It has two modules:

Dynamic Fault Trees (DFTs) Analysis

SAFEST is the most complete, effective, and scalable tool for analyzing static and dynamic fault trees. Static fault tree analysis is competitive with existing tools. It can check more reliability measures than ever and can check fault trees that are larger than ever. It can model systems having:

  • Redundant components
  • (Probabilistic) Functional dependencies
  • Temporal ordering of failures

Reward Event Trees (RETs) Analysis

In SAFEST we extend classical event trees with non-deterministic choices/decision-making at states, and allow the addition of state rewards/losses. Moreover, DFTs can be embedded into RETs and provide transition probabilities (of RETs). By analyzing RETs, one can determine:

  • Expected gains or losses, such as radioactive leakage, fatalities, etc.
  • Limits on the frequencies and probability of the outcomes in event trees
  • Wise choices to, for example, lessen the unfavorable effects/outcomes

Why do we need dynamic fault trees?

research-student

Because of limited expressiveness, SFTs cannot model dynamic behaviour of systems in which:

  • Components are redundant (Cold, Warm, Hot)
  • Components functionality is dependent
    • two sources generating energy in parallel -- hot spare
  • Components failure have some temporal ordering
    • first component A fails followed by components B and C
  • Components relationship/responsibility/priority change with time
    • a switch automatically changes energy source e.g. from main-grid to generators
  • research-student

    ISO 26262:2011 demands rigorous risk assessments in automotive industry:

    • “metrics are verifiable and precise enough to differentiate between different system architectures”
    • “[for systems where the] concept is based on redundant safety mechanisms, multiple-point failures of a higher order than two are considered in the analysis”

    research-student

    Rapidly increased usage of AI components in modern systems necessitates a rigorous risk assessment

    • Neural networks typically come with weak statistical guarantees only (if at all). Small perturbations of their inputs may lead to misclassifications that can be catastrophic in safety-critical applications
    • The use of AI components also implies the need for reliability metrics that go beyond the standard metrics of reliability and availability. In particular, there is a need to specify various levels of degradation (given the uncertainty of the AI components) and be able to analyse e.g., “what is the probability that the system will come to a halt without going through degradation level A first”.

    research-student

    FTA focuses on computing various dependability metrics, i.e. key performance indicators that measure how well a system performs. Standard metrics are the system:
    Reliability : the probability that no failure occurred until time T
    Conditional Reliability : the probability that no failure occurred until time T given a component has already failed
    Availability : the average percentage of time that a system is operational
    Mean Time to Failure (MTTF): the mean time between failures,
    Criticality of components : to what extent does a component failure contribute to a system failure, etc.
    Our tools also handles various extensions that include the cost and impact of failures.

    Application Domains

    Industrial Partners

    research-student

    13+

    Partners

    15+

    Projects

    4+

    Happy Clients

    30+

    Meetings