Learn from Failures

6 minute read

Published:

This article is about two incidents that happened in the past, creating catastrophic tragedies due to mistakes in simple lines of code. Incidents like this signify the importance of verification of the systems that we build.

Ariane 5 Launch Failure

Ariane project was under the European space agency with the intention of launching commercial payloads such as communication satellites to the Earth’s orbit. It was a successor of Ariane 4, which was a successful project. Ariane 4 was able to carry 5000 kg of payload and it had four engines providing thrust to escape from earth. It was the standard launch vehicle for the European space agency until they built a more powerful version “Ariane 5”. It was designed to carry a payload of 9600Kg and most of the systems were taken from its predecessor with further improvements. The First Maiden flight of the Ariane 5 was done on June 04th, 1996. The day before launching was not in good weather condition for launching. But, surprisingly launching day was clear. Approximately 37 seconds after the launch Ariane 5 lost its control and soon after, it got self-destructed. This ultimate failure happened due to the incorrect control signals that were sent to the engines. Those wrong instructions swiveled so that the unexpected stresses imposed on the rocket failed the structure. As a failsafe mechanism, Ariane 5 entered into self-destruction mode.


AI Generated Image : credits to deepAI

Why incorrect control signals were sent to the Engines?

The engines are controlled according to the current Attitude and the Trajectory of the rocket so that the engines can maintain an angle to the vertical. These measurements were taken by a computer-based inertial reference system (IRS). The problem was caused by the software that was running on the IRS system and it was a measurement validation failure. The Hardware had been designed with a lot of redundant features in case of a failure. And there were two inertial reference systems one as the major system and the next one as the backup system. Basically, they both were running the same software because they have added redundancy to avoid hardware failures.

What was the Software problem?

Although Engineers Upgraded the performance of the rocket when building Ariane 5, they used the same Software that was used in the Ariane 4. They haven’t done any modifications thinking that it may introduce more problems to the system and since it was tested in Ariane 4. The Software problem occurred when the IRS attempts to convert a 64-bit floating point number to a signed 16-bit integer to represent the velocity. The maximum positive value that can be expressed by the 16-bit signed integer is 32768. Exceeding this value represents a negative value. During Ariane 4 construction this was tested, and engineers haven’t handled exceptions because Ariane 4 did not exceed the 32768 value and also to reduce the processor workload. Introducing more capacity to the processor was not required for Ariane 4 but for Ariane 5 Engineers thought that it would introduce more dependability issues. Also, since Ariane 4 was launched successfully inertial reference system code was not reviewed again. The default exception-handling scheme was to shut down the system. This created an Inertial reference system to fail and therefore the engines got an incorrect control signal, leading to the shutdown of the rocket’s main engines.

The shutdown caused the rocket to lose control and deviate from its intended flight path. To prevent the rocket from veering into a potentially populated area, the range safety officer initiated a self-destruct sequence, causing the rocket to explode.

The Patriot Failure

Anti-missile systems are used in wars to avoid missile attacks from enemies. During the Gulf war, these systems were used successfully. Patriot was a missile interruption system used by the American military. It was having an accuracy of around 98% until February 25 in 1991. An Iraqi Scud missile was approaching Saudi Arabia. A software failure in the Patriot Missile system failed the total interruption system.

What is a Missile interruption system?

The missile interruption system should calculate the firing angle and the time to fire in order to destroy the enemy attack while it is in the air. To hit the target the calculations should be much more precise and extremely accurate.


AI Generated Image : credits to deepAI

What caused the Patriot missile system failure?

This failure occurred due to an inaccurate calculation of the time caused by computer arithmetic errors. The internal clock in the embedded system of the patriot was oscillating at a frequency of a tenth of a second. In order to calculate the seconds, programmers have multiplied the internal clock value by $1/10$. This calculation was done using a 24-bit fixed point register. Because $1/10$ has non-terminating binary expansion, it was trimmed to 24 bits after the radix point . When this number is multiplied by a larger number it was introducing a significant amount of error to the system.

At the time of failure, the system was working for about 100 hours and the error was accumulated for a longer amount of time and the error was about 0.34 seconds. Binary expansion of $1/10$ is $0.0001100110011001100110011001100….$ and the 24-bit register stored the value with an error of $0.0000000000000000000000011001100…$ in binary, it was $0.000000095$ in decimal. Multiplying this error with 100hours added the error up to $0.000000095×100×60×60×10=0.34$ seconds. Scud missiles travel at a speed of 1,676 meters per second and within 0.34 seconds they can travel more than half a kilometer. This was enough for the scud to get away from the Patriot successfully causing huge destruction sacrificing 28 soldier lives and injuring 100 other people.

Indeed, we can’t verify all the scenarios that a system may face during its entire life. But there are instances where a simple test case can capture an issue that could cause a catastrophic failure in the future. In conclusion, learning from these mistakes can help us create better and safe systems for the future.