Fault Trees and Reliability: Why Network and Autonomous Systems Need Redundancy
reliabilityengineeringtutorial

Fault Trees and Reliability: Why Network and Autonomous Systems Need Redundancy

UUnknown
2026-03-11
11 min read
Advertisement

Learn fault tree analysis and reliability math with Verizon and Tesla case examples. Practical redundancy exercises and MTBF calculations for engineers.

When networks and self-driving systems fail, people notice — and regulators do too. If you felt stranded during the last Verizon outage or uneasy after headlines about Tesla’s FSD investigations, you’re not alone. This guide teaches fault tree analysis and reliability math step-by-step, using those real-world episodes to show how redundancy, MTBF, and probability reduce risk.

Start here: the most actionable ideas first. If you design, operate, or audit network or autonomous systems, you’ll learn how to build a fault tree, compute top-event probabilities, pick redundancy strategies (active-active vs. diverse backups), and calculate availability from MTBF and MTTR. By the end you’ll be able to produce minimal cut sets, estimate the value of adding a redundant component, and create exercises your team can run in 30–60 minutes.

Why redundancy matters now (2026 context)

Two short vignettes from late 2025 set the scene:

  • Regulators reopened and expanded probes into Tesla’s Full Self-Driving (FSD) systems after reports that some cars ignored red lights and made unsafe maneuvers. NHTSA asked Tesla for fleet and incident data, proving regulators now demand traceable safety evidence from AI-enabled systems.
  • Major telecom outages (e.g., a high-profile Verizon disruption that drew public refunds and credit offers) showed that large providers still face single points of failure in routing, signaling, and service orchestration — with real economic and social impact.

These events illustrate two pressures shaping reliability engineering in 2026: stronger regulatory scrutiny (especially for AI/automation) and the high societal cost of outages. Organizations are responding with layered redundancy, probabilistic safety cases, AI-based monitoring, and investments in edge/cloud diversity.

Core reliability math you must master

Before you build fault trees, understand these core metrics. I use simple, repeatable formulas you can implement in spreadsheets or code.

MTBF (Mean Time Between Failures)

MTBF is the expected uptime between failures for a repairable component. It’s often measured in hours. For constant failure rate systems, failure rate lambda = 1/MTBF.

MTTR (Mean Time To Repair)

MTTR is the average time to restore a failed component to service.

Availability

For steady-state systems: Availability A = MTBF / (MTBF + MTTR). Example: if MTBF = 10,000 hours and MTTR = 10 hours, A = 10,000 / 10,010 = 0.999 = 99.9%.

Failure rate and reliability over time

For constant failure rate: lambda = 1/MTBF. Reliability R(t) = exp(-lambda * t). That gives the probability a component survives time t without failing.

Fault Tree Analysis (FTA) — a step-by-step tutorial

Fault Tree Analysis is a top-down way to model how component failures combine to cause an undesired top event (like a regional outage or an unsafe autonomous maneuver).

Step 1 — Define the top event

Be precise. Top event examples: "Loss of voice and data service in metro region" or "FSD ignores red traffic lights."

Step 2 — Identify immediate contributing faults

List causes and group them with logic gates: OR (any cause alone can trigger top event) or AND (need all causes together). Use intermediate events where appropriate.

Step 3 — Decompose to component level

Break each contributing fault into hardware, software, operations, and external factors (power, fiber cut, GPS spoofing). This is where MTBF/MTTR and field failure data enter the analysis.

Step 4 — Quantify probabilities

Assign probabilities using MTBF and R(t) or historical incidence rates. For independent events connected by:

  • OR gate: P(OR) = 1 - Π (1 - P_i)
  • AND gate: P(AND) = Π P_i

Step 5 — Find minimal cut sets

Minimal cut sets are smallest combinations of basic events that cause the top event. They help prioritize mitigation.

Worked example 1: Simplified fault tree for a Verizon-style service outage

Top event: Regional Service Outage

Immediate causes (OR):

  • Core router failure
  • DNS/service control plane outage
  • Major fiber cut + transport failure
  • Power failure at data center (with UPS failing)

Model assumptions (illustrative, simple numbers):

  • Core router MTBF = 100,000 hours -> lambda = 1e-5 /hr -> probability of failure over 24 hours: P_r = 1 - exp(-24/100000) ≈ 0.00024
  • DNS/service control plane historic outage probability over 24 hours: P_dns = 0.0005
  • Major fiber cut probability in region over 24 hours: P_fiber = 0.0002
  • Data center power outage with UPS failure in 24 hours: P_power = 0.00005

Top-event probability (OR): P_outage = 1 - (1-P_r)*(1-P_dns)*(1-P_fiber)*(1-P_power)

Numeric: P_outage ≈ 1 - (1-0.00024)*(1-0.0005)*(1-0.0002)*(1-0.00005) ≈ 0.000985 (≈ 0.0985% chance of outage that day under these assumptions).

Minimal cut sets (single-component causing top event): {Core router}, {DNS}, {Fiber cut}, {Power}. If you add redundancy to the router (two routers in parallel), you change the cut sets and decrease probability.

How much does redundancy help? — Router redundancy example

Two identical routers in parallel (independent failures) — system fails only if both fail (AND):

R_single (24h survival) = 1 - P_r = 0.99976

R_parallel = 1 - (1 - R_single)^2 = 1 - (P_r)^2 ≈ 1 - (0.00024)^2 ≈ 0.9999999424

Probability both fail in 24 hours ≈ (0.00024)^2 = 5.76e-8, i.e., dramatically reduced.

Worked example 2: Fault tree for Tesla FSD ignoring a red light

Top event: Vehicle fails to stop at a red light (unsafe action)

Contributing factors (OR):

  • Perception subsystem fails to detect the light
  • Decision/planning logic incorrectly classifies action
  • Actuator failure prevents braking
  • Localization/mapping error places vehicle off-lane

Decompose perception failure further (OR):

  • Camera failure
  • Lidar/radar failure
  • Perception software model misclassification (e.g., due to edge-case or training data drift)

Assume independent probabilities (illustrative):

  • Camera fail in 24h: 0.0001
  • Lidar fail in 24h: 0.00005
  • Perception model misclassify red-light event (edge-case): 0.001
  • Decision logic flaw causing no-stop: 0.0002
  • Brake actuator fail (no braking): 0.00002

Perception failure probability: P_perc = 1 - (1-0.0001)*(1-0.00005)*(1-0.001) ≈ 0.0011499

Top event P_top = 1 - (1 - P_perc)*(1 - P_decision)*(1 - P_actuator)*(1 - P_localization) with P_localization assumed 0.0001.

Numeric: P_top ≈ 1 - (1-0.0011499)*(1-0.0002)*(1-0.00002)*(1-0.0001) ≈ 0.001468 ≈ 0.1468% per 24 hours under these crude assumptions.

Key insight: reducing the perception model error from 0.001 to 0.0001 (by better training, validation, OOD detection) cuts the top-event probability substantially. So does adding diversity (multiple sensor types, independent perception stacks), or safe-design mitigations like minimum-risk maneuvers when the model is uncertain.

Redundancy design principles and pitfalls

Redundancy helps but can be misapplied. Here are practical principles:

  • Diversity over duplication: Identical systems can share common-mode failures (software bug, manufacturing defect). Use diverse stacks (different vendors, different algorithms) to reduce common-cause risk.
  • Independent failure assumptions: Probabilistic math often assumes independence. Check for coupling: shared power, shared cooling, or shared software updates break independence.
  • Common Cause Factor (beta): When modeling parallel redundancy, include a beta factor for common-cause: P_system_fail ≈ beta * P + (1 - beta) * P^2 (simplified form for two components), where beta ∈ [0,1]. Beta near 1 means high common cause risk and little redundancy benefit.
  • Active-active vs active-passive: Active-active often reduces switchover risk but increases complexity; active-passive simplifies failover at cost of underused capacity.
  • Operational readiness and testing: Redundancy that fails on failover is useless. Test failover modes regularly and monitor MTTR.

Exercise 1 — Router redundancy (hands-on)

Scenario: A core router has MTBF = 200,000 hours and MTTR = 4 hours. You are considering adding a hot standby router (identical model). Calculate availability with single router and with two in parallel. Assume independent failures.

Solution (step-by-step):

  1. Compute MTBF and MTTR availability for single router: A_single = MTBF / (MTBF + MTTR) = 200,000 / (200,000 + 4) = 0.99998 (≈ 99.998%).
  2. Probability of failure at any random instant: P_fail ≈ 1 - A_single = 0.00002.
  3. For two routers in parallel (system fails only if both fail simultaneously): P_system_fail ≈ P_fail^2 = 4e-10. So availability ≈ 1 - 4e-10 ≈ 0.9999999996 (effectively 99.99999996%).
  4. Interpretation: redundancy dramatically increases availability, but test for shared dependencies (power, software updates). If beta (common cause factor) = 0.01, approximate system fail = beta * P_fail + (1 - beta) * P_fail^2 ≈ 0.01*0.00002 + (0.99)*4e-10 ≈ 2e-7 + ~4e-10 ≈ 2.000004e-7, i.e., benefits reduce when beta > 0.

Exercise 2 — Sensor fusion MTBF for safe stop

Scenario: An autonomous vehicle uses three independent cameras to detect traffic lights; decision uses majority voting (at least 2 of 3 must see red to command stop). Each camera has MTBF = 500,000 hours. Over a 12-hour drive day, what is the probability the voting fails to see a red light caused solely by camera hardware failure (ignore perception software)?

Solution:

  1. Failure probability for one camera in 12 hours: P_cam = 1 - exp(-12/500000) ≈ 2.4e-5.
  2. Voting fails only if at least two cameras fail: P_fail_vote = C(3,2)*(P_cam^2)*(1-P_cam) + P_cam^3 ≈ 3*(2.4e-5)^2*(1-2.4e-5) + (2.4e-5)^3 ≈ 1.728e-9 + ~1.38e-14 ≈ 1.73e-9.
  3. So hardware-only voting failure probability over 12 hours ≈ 1.7e-9 (extremely small). But include software/model errors and environmental effects (glare) for real-world assessments.

As of 2026, three trends are reshaping reliability and fault analysis:

  • AI-driven predictive maintenance: Operators use sensor telemetry and ML models to predict imminent failures, increasing effective MTBF by reducing unplanned failure windows.
  • Digital twins & scenario simulation: Digital replicas of networks and vehicle fleets enable probabilistic what-if analyses before deploying updates, improving change management safety.
  • Regulatory focus on AI assurance: Agencies like NHTSA have expanded data demands and may require probabilistic safety cases for automated driving deployments. Telecom regulators and consumer protection bodies increasingly expect demonstrable resilience (e.g., post-outage credits and remediation).

Combine these with classic engineering: redundancy, diversity, monitoring, and rigorous FTA. For automotive teams, add software assurance practices: continuous validation on edge cases, explainable model outputs, and shadow mode testing.

Practical rule: redundancy is necessary but not sufficient. Design for independent failure modes, monitor for common cause, and test failover often.

Practical checklist for teams (actionable)

  • Run an FTA on every critical top event at least annually and after major architecture changes.
  • Collect field MTBF data and update lambda estimates monthly; use Bayesian updates if data is sparse.
  • Model common-cause factors explicitly (beta) rather than assuming independence.
  • Prefer diversity (different vendors, algorithms) when possible, not just duplicates.
  • Implement automated failover test harnesses and simulate outages during maintenance windows.
  • For AI systems, instrument uncertainty estimates (confidence, OOD detection) into safety decisions.

Exercises and answers (downloadable worksheet idea)

Below are two short exercises you can paste into a spreadsheet or assign to a study group.

Exercise A (network):

Given: Two ISP upstream links (A and B) in parallel. Link A yearly outage probability 0.02 (2%), Link B yearly outage 0.015 (1.5%). Assume independence. What's the annual probability both links are down?

Answer: P_both = 0.02 * 0.015 = 0.0003 (0.03%). Router-level failover reduces single upstream risk dramatically; if beta = 0.1 (shared fiber bundle), approx P_both ≈ beta * min(P) + (1-beta)*product ≈ 0.1*0.015 + 0.9*0.0003 ≈ 0.0015 + 0.00027 = 0.00177 (0.177%).

Exercise B (autonomy):

Design a minimal redundancy scheme that reduces top-event P_top (from earlier Tesla example) by at least 50% without changing hardware MTBF. What approaches can you take?

Sample answer approaches:

  • Introduce perception diversity: add radar/lidar fusion with independent processing to reduce perception model misclassification probability.
  • Introduce uncertainty-aware decision logic: if perception confidence below threshold, execute safe-stop fallback, reducing P_decision-induced top events.
  • Improve monitoring and rollback for OTA updates to cut probability of a decision logic regression.

Quantitatively, if perception misclassification P_perc drops by 50% through diversity and validation, and decision P_decision stays the same, you often see more than 50% reduction in P_top because of multiplicative effects.

Conclusion: Make redundancy purposeful

Redundancy is a powerful lever to improve system reliability — but only when applied with awareness of independence, diversity, and operational readiness. Fault tree analysis gives you the language to map causes, quantify risk, and prioritize mitigations. In 2026, with regulators pressing for transparent safety evidence (as in the Tesla FSD inquiries) and network outages drawing public scrutiny (as in high-profile Verizon incidents), teams that can show quantified, tested resilience gain a competitive and compliance advantage.

Actionable next steps: Pick one critical top event in your system, run a one-hour FTA with your ops and dev leads, compute MTBF-derived probabilities for each basic event, and prototype one redundant or diverse mitigation to test in staging.

Call to action

Want a ready-to-use FTA worksheet and MTBF calculator? Download the free reliability toolkit at equations.top, or sign up for a guided FTA coaching session. Strengthen your systems before the next outage or investigation — design, measure, and prove resilience.

Advertisement

Related Topics

#reliability#engineering#tutorial
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-03-11T00:04:24.675Z