In this class, we take a broad view of safety. An accident includes any undesirable loss—this could mean stakeholder losses like loss of life, mission failure, environmental damage, or the loss of critical protected information. This broad definition might be wider than you're used to, and the scope of safety can be customized to the types of losses your stakeholders care about. For instance, in the nuclear industry, safety might mean preventing damage to nuclear fuel rods, while in another field it could mean protecting confidential data or maintaining the integrity of a space mission.
When we use terms like hazardous or unsafe, we are referring to conditions that lead to any form of stakeholder loss—not just loss of life. It is important to establish from the outset that our concern is with a wide spectrum of potential losses, depending on the context of the system and the values of the stakeholders involved.
We face a major problem in engineering complex systems, which has been recognized for decades. The issue lies in when and where defects are introduced versus when they are detected. Approximately 70% of defects are introduced during the requirements and design phases. Only about 20% are introduced during the software coding or hardware detailed design phases. However, we tend to discover these defects much later, particularly during testing and integration phases.
This delay causes significant cost increases. The cost difference between fixing a defect during testing versus requirements can be as high as 21 to 80 times more. In fact, some data suggests that the cost of catching a defect late can be up to four orders of magnitude higher. These are not hypothetical figures—they're based on real industry data.
NASA's 2015 data showed that although only 8% of project cost is spent during the concept phase, nearly 45% of the total cost is committed in that phase because of the long-term consequences of early decisions and assumptions. In essence, early-stage decisions lock in a significant portion of future costs.
The Government Accountability Office (GAO) in the U.S. conducted a study of government-funded programs and found that the number one driver of cost and schedule overruns was inadequate systems engineering. Programs that suffered from these overruns had poor upfront investments in activities such as requirements analysis, concept analysis, and design processes.
In this course, most examples and exercises will focus on the early stages of a project—concept, requirements, and early design—because that's where we have the greatest opportunity to influence outcomes at the lowest cost.
System safety and systems engineering grew up separately. About a hundred years ago, the foundation of system safety was rooted in the idea that human error caused accidents. This perspective led to models like the Domino Model and eventually the Swiss Cheese Model (1980), all of which emphasized human error as the root cause.
In the 1950s and 60s, analytical techniques such as Failure Modes and Effects Analysis (FMEA) and Fault Tree Analysis (FTA) were developed. These methods were particularly effective for mechanical and electromechanical systems of that era. Functional FMEA emerged as well, focusing not only on hardware but also on system functions.
In the 1980s, a significant shift occurred—software began to control safety-critical systems. Despite this, safety methodologies didn't evolve in parallel. Most techniques still in use today are incremental updates of methods from the 50s and 60s. While they remain useful, they are ill-equipped to handle the complexities introduced by modern software-driven systems.
Research has introduced alternatives, but few have gained wide industry adoption. One notable exception is Systems-Theoretic Process Analysis (STPA), which has been codified into standards and used widely across industries.
STPA was created to address new types of causes that older methods tend to miss. Unlike FMEA and fault trees, STPA focuses on system interactions, control structures, and process models. It helps identify unsafe control actions and emergent behaviors.
This course is structured around this progression:
The goal is not to declare STPA a silver bullet. Every method has limitations. What's critical is understanding when a method applies, when it doesn't, and what type of problem it's best suited to solve.
One problem in industry standards and training courses is that they rarely highlight method limitations. Engineers are taught how to follow the steps of a method, but not when not to use it. That's a gap this course aims to fill. We'll openly discuss the strengths and shortcomings of each approach.
A common misconception in safety is the overreliance on probability as a universal tool for risk assessment. While probability is useful in many scenarios—particularly when estimating hardware failure rates or known risks—it is inadequate for a wide range of system and human-centric errors.
For example, what is the probability that a requirement in a system is incorrect? Or that a function has been completely missed? What about the probability that the design of a control algorithm is unsafe, even if it meets all documented requirements? These are not quantifiable in the same way hardware failures are.
Industry standards often prohibit the use of probability to estimate software design flaws, human errors, or organizational failures because these are inherently unpredictable. Probabilistic analysis struggles with estimating risks that involve assumptions, architectural flaws, and missing requirements—exactly the types of errors that dominate modern safety incidents.
Historically, systems have failed because flawed probabilistic reasoning gave a false sense of security. This leads to situations where designers calculate extremely low risks based on fault trees and numeric estimates but fail to account for interaction-level or assumption-level vulnerabilities.
We will explore several real-world examples that reveal how probabilistic models can be off by multiple orders of magnitude—and how such errors contributed to catastrophic system failures.
Traditional safety techniques, especially those originating in the 1950s and 60s, are focused on component failure losses. These occur when a component does not perform according to its written requirements. The consensus across various industry standards, such as IEC 61508, is that a failure is when a component either stops providing a required function or behaves in a way that contradicts its specifications.
Classic examples:
All of these result from a deviation from written requirements, and they're the foundation for reliability engineering. Solutions usually involve redundancy (multiple valves or pumps), preventive maintenance, or fail-safe design (e.g., a spring-loaded valve that fails open).
Modern safety must also address component interaction losses—situations where all components operate as designed, yet the system produces unsafe behavior due to unanticipated interactions.
This shift began to be recognized in the 2000s, giving rise to what's now called the new view of system safety:
These are often emergent behaviors not explicitly specified anywhere. They occur because of flawed assumptions, incomplete requirements, or overlooked dependencies between software, hardware, and human actions.
The Mars Polar Lander was an unmanned spacecraft designed to land on the Martian surface. During its descent, the lander deployed a parachute to slow down in Mars' thin atmosphere. It then jettisoned a heat shield and deployed three landing legs equipped with vibration sensors—also known as touchdown sensors. These sensors were designed to detect when the spacecraft touched the Martian surface, signaling the computer to cut off the descent thrusters.
During descent, the sequence of events executed exactly as planned:
These sensors were doing their job correctly—they registered the vibration and sent a signal to the flight computer. The software, interpreting these simultaneous signals from all three legs, concluded that the lander had touched down on Mars. Following its programming, it immediately shut off the descent thrusters.
But the lander hadn't touched down. It was still 40 meters in the air. With the thrusters disabled, it entered free fall and crashed onto the Martian surface at high speed. The $110 million mission was lost.
Who was at fault? It's complicated:
And they were both right. Every component operated correctly, in accordance with its written requirements.
This was not a component failure.
This was an interaction failure.
The engineers had focused extensively on failure modes. Thousands of potential hardware and software failures were analyzed. But no one had asked: What if everything works as designed—but the system logic is flawed? That question never made it into any fault tree, any FMEA, or any requirements review.
The critical flaw here was in the interaction between correct behaviors. The landing leg sensors responded to a legitimate vibration (from leg deployment), but the system treated that vibration as a landing event. That interpretation occurred too early—when the spacecraft was still airborne.
This highlights an important lesson:
Emergent behavior can arise even when individual components behave correctly.
Failure-based methods like FMEA and Fault Tree Analysis are excellent at identifying what might go wrong with a component. But they assume the system works as designed. They rarely consider what happens when components interact in unanticipated—but technically correct—ways.
In this case:
What failed was the system's model of the world—its belief that it was already on the surface of Mars.
This is precisely the kind of hazard that STPA is designed to uncover.
Hitomi was a Japanese X-ray astronomy satellite launched by JAXA in 2016. It was designed to observe high-energy phenomena like black holes and galaxy clusters, offering deep insights into the structure of the universe. The spacecraft carried extremely sensitive instruments and had an ambitious mission profile. Everything worked perfectly after launch, and early operations were promising.
But then, just over a month into the mission, Hitomi suddenly went silent. Ground control lost contact. Within hours, telescopes tracking the spacecraft observed that it had broken into pieces. A $273 million mission was lost.
The sequence of failure began with a faulty reading from one of the spacecraft's inertial reference units (IRUs). It falsely reported that the satellite was slowly rotating. This was not true—the spacecraft was stable. However, the onboard flight software responded by commanding the reaction wheels to counter the perceived spin.
The spacecraft had multiple attitude sensors, including star trackers that could independently verify orientation by capturing images of the sky. These star trackers disagreed with the IRU and correctly reported that the satellite was not spinning. But the system had been configured to trust the IRUs more than the star trackers. The disagreement led to the star tracker data being disregarded.
As the IRU continued to falsely indicate increasing rotation, the system kept spinning the reaction wheels faster to compensate. Eventually, the wheels hit their speed limits. At this point, the system triggered an automatic failover: it switched to thrusters to stabilize the spacecraft.
This was a catastrophic misstep. Since the satellite wasn't rotating in the first place, the thruster burst imparted real spin to a previously stable spacecraft. It began to tumble uncontrollably. The spin caused structural elements like solar arrays and instruments to break off, leading to complete disintegration.
Every component in the Hitomi satellite worked as designed:
The issue was a classic interaction failure:
This was not a hardware malfunction. It was a failure in the system's control logic and trust architecture. The satellite trusted the wrong data, and when components followed through with their responses, the system spiraled into destruction.
A systems-level safety analysis method like STPA could have identified the unsafe control actions that led to this outcome:
Question 4: What critical design flaw led to the destruction of the Hitomi satellite?
Incorrect. The star trackers actually worked correctly and reported that the satellite was stable. The problem was that the system was designed to trust the IRU data more, even when the star trackers contradicted it.
Correct! The critical flaw was in how the system handled sensor disagreement. When the IRU incorrectly indicated rotation while the star trackers correctly showed stability, the system was designed to trust the IRU more. There was no mechanism to resolve this conflict or detect that the IRU might be providing false data. This demonstrates a process model flaw in the control system—it believed the satellite was rotating when it wasn't.
Incorrect. The reaction wheels functioned as designed, spinning faster in response to the IRU data. They hit their speed limits because they were continually trying to counter a rotation that didn't actually exist.
Incorrect. The software executed exactly as it was programmed to do—the issue wasn't bugs in the code. The problem was the design of the system architecture and trust relationships between sensors, which is a higher-level design decision rather than a coding error.
On March 18, 2018, an autonomous vehicle operated by Uber struck and killed a pedestrian in Tempe, Arizona. This was the first known fatality involving a self-driving car and a pedestrian.
The vehicle involved was a modified Volvo XC90 SUV equipped with Uber's experimental autonomous driving system. Although the car had a safety driver sitting behind the wheel, the vehicle was in full autonomous mode at the time of the crash.
The victim, a woman named Elaine Herzberg, was walking her bicycle across a darkened stretch of road at night. There was no crosswalk at that location. The road was wide, visibility was poor, and she emerged from the shadows into the vehicle's path.
The self-driving system had nearly 6 seconds to react.
Uber's software stack had a complex system for perception and tracking. The system worked roughly like this:
In Herzberg's case, the system detected her. Repeatedly. But over the span of several seconds, it kept changing its classification:
Each classification had different default behavioral predictions. For instance:
Because the classification kept changing, the predicted behavior of the object also changed constantly.
Crucially, Uber's software stack did not maintain state continuity. Every cycle, it restarted the object classification and prediction process from scratch. This meant that the object's identity and behavior were being reinterpreted every 100 milliseconds or so. As a result:
Only 1.2 seconds before impact, the system finally recognized that the object was a pedestrian and that a collision was imminent. But by design, it was too late.
The emergency braking system built into the Volvo was disabled during autonomous operation to avoid conflicting with Uber's own braking software. And Uber's own software had a built-in delay in taking emergency actions—partly to reduce false positives.
The system never activated the brakes.
The car struck Herzberg at approximately 40 mph. She died from her injuries.
This was not a failure of perception hardware:
Instead, this was a failure of perception logic and interaction assumptions.
Key design flaws included:
This was yet another case where components worked individually, but the interactions between subsystems failed to deliver safety.
This is a textbook example of system-level failure due to interaction complexity and flawed assumptions, not component malfunction.
Question 5: In the Uber self-driving car accident, what design flaw in the perception system contributed most directly to the fatality?
Incorrect. The sensors actually detected the pedestrian successfully multiple times. The LIDAR and radar systems were functioning properly. The issue was with how the system processed and interpreted this sensor data over time.
Correct! The perception system kept reclassifying the pedestrian (first as an unknown object, then a vehicle, then a bicycle, then back to unknown) and didn't maintain state continuity. Every 100 milliseconds, it restarted classification from scratch without "remembering" what it had detected previously. This lack of temporal memory meant the system never gained enough confidence about the object's identity or trajectory to trigger emergency braking.
Incorrect. The software was not buggy—it executed as designed. The issue was the design itself, particularly how the perception stack processed information across time and made decisions about when to apply braking based on classification confidence.
Incorrect. While the safety driver's distraction was a contributing factor, this quiz question asks specifically about the design flaw in the perception system that contributed to the accident. The system's design flaws would have been problematic regardless of driver attention.
The Boeing 787 Dreamliner was a revolutionary aircraft in many ways. It was Boeing's most advanced passenger airplane at the time, featuring cutting-edge composite materials, improved fuel efficiency, and state-of-the-art avionics. But one of its boldest design decisions was the use of lithium-ion batteries—the same type used in laptops and smartphones—to provide electrical power for many onboard systems.
This decision was driven by the need for weight reduction and improved power efficiency. However, lithium-ion batteries come with well-known risks: if overcharged, overheated, or physically damaged, they can catch fire or explode.
To mitigate this, Boeing built multiple layers of redundancy and protection around the battery system. They installed fire-resistant boxes, implemented smoke sensors, and engineered software logic to monitor and shut down systems under abnormal conditions.
Despite these precautions, two serious battery fires occurred within months of the aircraft entering service, leading to the grounding of the entire 787 fleet worldwide in January 2013.
On January 7, 2013, a Japan Airlines 787 was parked at the gate at Boston Logan International Airport. With no passengers onboard, the auxiliary power unit (APU) was running to keep systems operational. Then, unexpectedly, smoke was seen coming from the battery compartment.
Firefighters were called. When they opened the battery enclosure, they discovered thermal runaway—a battery cell had overheated, caught fire, and triggered adjacent cells to ignite. It took hours to fully cool the compartment.
Luckily, no one was injured. But it raised alarm bells.
Just over a week later, an All Nippon Airways (ANA) 787 had to make an emergency landing in Japan when pilots received warnings about battery problems and detected a burning smell in the cockpit. Passengers were evacuated, and once again, thermal runaway in the battery system was the cause.
This second event confirmed that the first fire wasn't an isolated manufacturing defect. Something systemic was wrong.
The NTSB (National Transportation Safety Board) launched an investigation. They examined battery enclosures, circuit boards, fire containment systems, and logs of system behavior.
What they found was not a single failed component—but a collection of interacting design assumptions that failed under real-world conditions.
Key findings included:
So once again:
This wasn't just a battery issue—it was a system-level issue:
The 787 battery fires provide a clear example of modern aerospace systems operating in tightly coupled, highly interactive environments—where failure often comes not from a broken part, but from broken assumptions about how parts behave together.
These incidents forced Boeing to redesign the battery enclosure, add stronger fire containment, modify software behavior, and review all power management assumptions. It was a sobering reminder that:
"Safe components do not guarantee a safe system."
This incident emphasizes the value of methods like STPA, which explicitly examine unsafe control actions and flawed mental models—not just component reliability.
Question 6: Which statement best describes why the Boeing 787 battery fires represented a system-level rather than a component-level failure?
Incorrect. While manufacturing defects in one cell may have contributed to the initial short circuit, the case study emphasizes that the key issue was how the entire system responded to this event. Individual component quality was not the primary system safety concern.
Incorrect. The engineers did follow standard safety protocols of the time and implemented multiple layers of protection. The issue was that their design assumptions about how these various protection systems would interact were flawed.
Correct! This is the essence of a system-level failure. The smoke detection system worked, the battery monitoring system worked, the fire containment enclosure worked—all according to their individual specifications. However, their interactions created unsafe conditions: the power management software shut down fans that were supposed to vent smoke, thermal runaway cascaded across cells despite assumptions that failures would be isolated, and the fire containment couldn't fully prevent heat propagation. These interaction failures and flawed assumptions about system behavior are hallmarks of modern system safety issues.
Incorrect. The smoke detection system actually worked as designed and detected the smoke. The problem was that in one case, the power management software shut down the fans that were meant to vent the smoke, showing an interaction problem rather than a component failure.
As we've seen from previous case studies—Mars Polar Lander, Hitomi, Uber's autonomous vehicle, and the 787 battery fires—failures today often stem from correctly working components interacting in unsafe ways.
To understand and prevent these kinds of losses, we need a model that allows us to:
This is where the Control Loop Framework comes in—a foundational concept used by STPA.
At the core of any system involving decisions, commands, and actions lies a control loop. This framework is composed of the following elements:
If a controller's process model is flawed, it may make decisions that are entirely reasonable—given the model—but completely unsafe in reality.
Let's revisit the Mars Polar Lander through the lens of a control loop:
The process model was incomplete. It failed to distinguish between a vibration due to landing gear deployment and an actual surface landing.
STPA focuses on identifying Unsafe Control Actions (UCAs)—cases where a control command:
These unsafe actions often arise not because of bad code or hardware failure, but because of incorrect assumptions within the controller's process model.
The framework is not limited to software systems. In fact, it works especially well for human-machine interactions.
Take an air traffic controller:
If the controller misinterprets radar signals (e.g., due to signal lag), they may issue a command that leads to a near miss or collision—even though their actions were reasonable based on their belief.
Across systems, common sources of unsafe control include:
All of these are prime targets for analysis within STPA.
Most traditional safety tools focus on what might fail. Control loop modeling focuses on why correct actions in context may still lead to loss.
It gives us the vocabulary to talk about:
… all of which are now primary contributors to modern accidents.
Below is an interactive diagram editor that shows the control loop framework discussed in the previous section: