Lessons in risk management from the Boeing 737 MAX tragedies
The Boeing 737 MAX disasters critically shook the trust of airlines, passengers and regulators globally. Although the lessons learned have been the theme of numerous government reports and investigative journalism, little has been discussed regarding metrics that could have provided early warnings of risk, and thus potentially averted disaster. This viewpoint draws from the lessons of the MAX groundings to illustrate the importance of selecting, monitoring, and acting upon risk indicators to preemptively manage risk, as well as provide opportunities to reduce total cost of risk, improve financial performance, and assure the board that risk is being addressed on a controlled and informed basis.
The human and commercial toll
Advertised as reliable, efficient and “a pilot’s best friend”, the 737 MAX launched in 2017. It quickly became Boeing’s fastest-everselling aircraft. However, in October 2018, Lion Air Flight 610 crashed into the Java Sea, killing all on board. Questions were raised about the design and, in particular, the software in use on the MAX, but Boeing assured customers and passengers that it was safe. Then, in March 2019, less than five months later, Ethiopian Airlines Flight 302 crashed shortly after take-off. Every crew member and passenger died. There were clear similarities between these accidents, and within days, a global grounding of the 737 MAX was prompted, affecting 387 planes from 59 airlines. The cost to Boeing so far, including over 800 canceled orders and an ever-growing number of lawsuits, is estimated to be in excess of $18 billion, in addition to considerable reputational damage. The 737 MAX, the aircraft intended to be Boeing’s leading weapon in the ongoing battle with rival Airbus, became a serious liability.
The crashes of the two 737 MAX aircraft led to global scrutiny of Boeing’s practices and culture. There have been several investigations into the catastrophic failures that saw loss of 346 lives. These identified several critical factors:
- Production pressure – The unexpected unveiling of the Airbus A320neo forced Boeing into a race against time to produce a model that was competitive, but similar enough to earlier 737 aircraft that it could be flown by the same pilots with minimal training, and thus be awarded an amended type certificate by the Federal Aviation Administration (FAA).
- The software “band-aid” – Design constraints meant the MAX had undesirable aerodynamic characteristics. Boeing attempted to correct the dynamic instability with software: the Maneuvering Characteristics Augmentation System, or MCAS. The software, which had not existed in previous 737 models, was designed to automatically push the plane’s nose down in certain conditions. Boeing assumed pilots, who were unaware of MCAS, would be able to mitigate malfunctions. However, in reality, the software could overrule the pilot based on one measurement from a potentially faulty sensor.
- Lack of redundancy on safety-critical components – Boeing permitted the software to depend on a single angle of attack (AOA) sensor, despite there being two installed on each aircraft. The AOA disagree alert, designed to warn pilots of contrasting readings, was not available on some versions of the MAX.
- Insufficient training – The training procedures on the MAX, which were greenlighted by the FAA, left pilots underprepared to deal with MCAS malfunctioning. Pilots weren’t told about the MCAS software, and it was not mentioned in the amended type certificate. Lack of need for simulator training was a key feature of Boeing’s marketing strategy; a two-hour, computer-based course was all that was required for a 737 pilot to be permitted to fly the MAX.
- Delegated regulatory authority – Lack of resources meant that it was common practice for the FAA to delegate safety certification of Boeing products to Boeing itself. This was carried out by “authorized representatives”, who were employed by Boeing but represented the interests of the FAA. By 2018, the FAA was allowing Boeing to certify 96 percent of its own work. Delegating this responsibility left Boeing with a conflict of interest that, when coupled with miscommunications, impacted the independence of the FAA.
- Poor post-incident response – The Lion Air crash could have been an opportunity for Boeing to identify its own problem and rectify it. However, Boeing looked to absolve itself of any responsibility, questioned the abilities of the pilots, and was slow to fully acknowledge and comprehend the threat of MCAS.
The myriad failures that contributed to the crashes underscore serious systemic shortcomings in Boeing’s ability to manage risk. In addition to the organizational, cultural and technical failures, there was apparent negligence in the management of risk. Could Boeing have recognized early signs that warned of potential failure? Is it possible that there were missed opportunities to control this risk?
Key risk indicators
Key risk indicators (KRIs) are increasingly powerful tools in the management of operational risk. When selected appropriately, they can be used to provide foresight of potential risk before it is too late to take corrective action. In this way, an organization can reduce the likelihood of potentially catastrophic events, and thus mitigate serious safety, financial and reputational losses before they occur. KRIs may be used to assess:
- Current risk exposure levels, by linking real-time data to potential loss events.
- Emerging risks (via leading indicators) that may need to be addressed now or in the future.
- Risk events that have happened and may happen again (via lagging indicators).
Leading indicators are particularly effective in enabling an organization to move towards a predictive early-warning system. In a climate of increasing risk complexity, hindsight is no longer sufficient; risk management systems must be able to keep pace and respond to emerging threats. With the support of artificial intelligence and machine learning, it is possible to monitor risk exposure and make decisions in real time.
However, this data-driven, forward-looking approach can be challenging for many organizations to implement, particularly with the lack of guidance that may be available to them.
Effective KRI selection
For a KRI framework to be successful, it is vital that KRIs are well selected so they can support the business and its objectives within the risk appetite. The indicators must be relevant to the risk environment in which the organization operates, with explicit links to available data, which can come from many internal and external sources.
Choice of indicator may be drawn from risk assessment; a bowtie analysis may be particularly advantageous here. Through this method, potential underlying causes of risk events are identified, which are evaluated to determine predictive measures that could serve as KRIs. These must be quantifiable to allow for objectivity and comparability.
It is also possible to establish KRIs by looking to historical data and events both inside and outside the organization. With hindsight, it is often easier to identify precursors to events that could have acted as early warnings. Consider Air France Flight 447, for example, which crashed into the Atlantic Ocean in 2009, killing 228 people. Similar to the MAX crashes, the pilots struggled to respond when the software on the plane began to behave unexpectedly. This exposed KRIs relating to human interaction with software systems that Boeing could have learned from: the level of control that software can exert, insufficient pilot training and experience, and inaccurate assumptions about human behavior, particularly in crisis situations.
Indicator data should be simple, cost-effective to collect and report, and easy to interpret. It is a huge advantage if this data can be provided in real time (or close to real time) digitally, without relying on manual data entry, which slows down KRI reporting and is prone to error. For Boeing, whose management may have become disconnected from the day-to-day engineering challenges following an organizational restructuring in 2001,1 a simple KRI dashboard could have provided senior decision-makers with a reliable means to track risk exposure.
Extracting value from KRIs
In order to make effective use of KRI data, an organization needs to establish “escalation triggers”. These are thresholds that, if breached by an indicator, provoke actions from specified parties. Such thresholds and limits are intrinsically linked to the organization’s risk appetite and can be established from historical performance data and typical ranges. They can be monitored and adjusted as required to reflect changes in the risk landscape.
There are a number of factors that could have incited Boeing to take action if thresholds and monitoring systems had been in place. These include:
- Staff complaints – A growing body of complaints and concerns were being raised and documented.
- Staff under pressure – A study found 39 percent of Boeing employees felt they were under “undue pressure”.
- Simulator testing results – Pilots reported the MCAS system was “running rampant”. The first officer in the Lion Air crash, who had performed poorly in training, struggled to run through a list of procedures that he should have memorized.
- Errors made by engineers – Blueprints were being produced at double the normal rate, and were often delivered to the factory floor incomplete or with errors.
It is important to use appropriately pitched threshold levels. If these are too loose, actions will not be triggered in a timely manner. If they are too strict, there may be a series of false alarms, which could result in a boy who cried wolf scenario. It is likely, for example, that in the product development process for the 737 MAX, there would have been unsuccessful simulator tests. However, the value lies in determining at what frequency this becomes “unacceptable” and knowing what action is required.
Often overlooked are the consequences of Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.” Essentially, there becomes a focus on managing the indicator rather than the underlying risk. Where the organizational culture prioritizes improving the numbers, efforts to address the emerging risk trends may be misplaced. This effect is demonstrated by the Volkswagen emissions scandal. Regulation sought to limit the toxic gases produced by vehicle engines. However, instead of designing the cleaner and greener cars that the regulation intended, Volkswagen engineered an exhaust system that could limit emissions during testing and thus sidestep the restrictions.
Balancing KPIs with KRIs
The MAX crashes draw many parallels with the 1986 Space Shuttle Challenger disaster. NASA leaders, like Boeing, were incentivized by speed and power, and the results were telling. NASA also repeatedly ignored safety warnings by its own engineers that intervention was needed to prevent a disaster during the launch. Achieving performance targets outweighed safety concerns.
Like with NASA, one of the principal issues that Boeing had to deal with was the inherent conflict between KRIs and key performance indicators (KPIs). KRIs differ from KPIs in that KPIs track historical performance, while KRIs provide foresight for future threats. However, KRIs also share some parallels with KPIs; they are both linked to a company’s strategic priorities, and take the form of metrics to flag up when a business is off track in meeting its objectives.
Boeing exists in a commercial climate with only one major competitor, Airbus. The MAX was born out of commercial necessity due to their aggressive rivalry. If Boeing did not hit its KPIs, it risked losing market share to its main competitor, which could have been very difficult to recover. However, a good set of KRIs, based on a robust cause-and-effect analysis supplied with good data, could have provided warning to Boeing’s management to slow down on the development of the 737 MAX. KPIs alone could not have provided this early warning so effectively.
Schedule and cost are, of course, critical drivers for any engineering and manufacturing company to stay commercially competitive. However, after many disasters, safety and quality were found to have been undermined or compromised by such commercial drivers. In the case of Boeing, KPIs would have pointed to increased growth rates for decades, but could use of KRIs have painted a different picture? By having a more forwardlooking risk capability, Boeing may have been more alert to safety as a key business function.
Insight for executives
Development of safety-critical products always brings risks that must be managed. In the case of the MAX, production pressures meant Boeing struggled to balance risk management and business performance.
Modern companies are awash with vast quantities of data, but our research shows that few companies are good at exploiting this data to provide a wide range of effective indicators of potential risk. The key skill is in making use of this data and establishing how this can be used prior to events unfolding, rather than in hindsight, when it is too late.
KRIs provide this capability, based on robust cause-and-effect analysis, ingestion of real-time data, and warnings against carefully set thresholds.
Although the MAX planes are now starting to fly again, they were grounded worldwide for 20 months, a period for Boeing and the FAA to reflect on and correct both known and newly identified risks. How many of these risks might have been predicted by KRIs?
The risk lessons learned can be used across the travel and transport industry, to self-scrutinize our own practices and select KRIs to stop accidents such as those with the MAX from happening again.