Managing Reliability: Increasingly a Matter of Business Survival with Darcy Brooker

Reliability is important for its effect on capability – that is, our ability to safely perform an activity or achieve an outcome – such that the capability is achieved with the lowest total ownership cost. So, despite often being viewed as a technical field of engineering or logistic support analysis, reliability management is also a facet of business or operational decision making.

‘The best reliability management decisions are based on business value’

Like all good business decisions, the best reliability management decisions are based on business value. Most often, such decisions, involve some form of trade-off involving scarce resources, including time and personnel. But as with any competent decision-making, optimum reliability management decisions are made with the fullest knowledge of the likely and potential benefits, detrimental risks, and short-term and long-term costs of all the decision alternatives (including the decision to do nothing). Sometimes reliability management decisions (or the failure to make decisions) occur without such an appreciation, particularly relating to potential benefits. This article outlines, in general terms, the business benefits of managing reliability.

Reliability might be considered a synonym for survival, and as an antonym for failure. When people ask “What is the value of managing reliability when other issues are much more important?”, the complementary question is “What are the consequences of failure?” Managing reliability is managing the risk of the capability or asset failing. If asset failure consequences are relatively unimportant, there is little value in managing reliability – but then why do we have the assets at all? Reliability is important whenever the risks or costs of failure are significant.

About the ‘total ownership cost iceberg’

But looking specifically at the financial aspect, one simple way to consider the business impacts of reliability performance is to consider that, in general, purchase price (acquisition cost) may comprise around 30% of total ownership cost. More than double the purchase price is spent over the period of use (the operations and support period) in direct operating costs (e.g. operator salaries and fuel), but other sustainment costs which relate directly to asset failures and failure prevention can predominate. These costs include support equipment, maintainers and logisticians, spare parts, warehousing, transportation, and training relating to sustainment. Collectively, these costs are sometimes described as the total ownership cost iceberg, where only the acquisition cost is visible above the waterline.

The business decision is often: how much of the acquisition cost are we prepared to devote to minimising sustainment costs. While a dollar now is preferred to a dollar in the future, what about a dollar now for ten dollars in the future? Reliability management can provide return on investments (ROI) of 10:1 or greater. While there is always a point where further increases in reliability will provide negligible sustainment cost savings relative to the investment, most organisations are a long, long way from this position. Another way we might think about this is to ask ourselves how much might be saved, or our capability improved, if our systems or products never failed in unpredictable ways, and what is it worth to get closer this state?

The performance of many products and systems has improved dramatically over the years across many areas, but reliability performance has not always kept pace. For example, the Defense Science Board (2000) found that 80% of US Army systems failed to achieve even half of their reliability (MTBF) requirements in operational test. Reliability problems degrade capabilities and drive up costs, creating significant budget impacts. However this need not happen; in many areas, products with increased performance and increased system complexity are also more reliable.

Likely reasons for under-performing reliability include insufficient attention or priority to operations and support costs before purchase or acquisition, i.e. over-emphasis on technical performance. Technical performance considerations are often marred by poorly defined or unrealistic reliability requirements as well as too little planning of reliability-focused design activities and incentives to improve reliability during the production or acquisition of an asset or capability. The consequence is reduced capability and spending too much on activities such as system redesign, spares management, and maintenance.

About comprehensive reliability management programs

Comprehensive reliability management programs can be both feasible and effective. The key to acquiring high reliability products and systems is to recognize reliability as an integral performance characteristic and to systematically eliminate failures and ways of failing. Different activities and techniques to achieve this are needed at different times. For example, before acquiring new capabilities, we might review the existing reliability performance of existing capabilities and its effects, and set realistic reliability goals based on intended use conditions and latest technologies, along with a broad, integrated plan or program of work required to achieve the reliability goals. The design process should integrate design-for-reliability practices (such as derating and other design guidelines; reliability modelling, allocation and prediction; and failure modes effects and criticality analysis). Prototyping might include highly accelerated life testing, and the start of a failure reporting and corrective action system. And so on, into production, and then into operations and support.

A truism in reliability engineering is that everything fails (eventually). Managing reliability, then, is about cost-effectively trading off some acquisition cost for substantially more in-use savings, designing in reliability, predicting lifetimes and failure onset, extending life (ideally to when the item is no longer required or else is superseded), preventing imminent failures through maintenance, and continuously improving performance in all these areas.

But there is no single ‘silver bullet’ solution set. Processes need to be tailored to suit the application and the organization, noting that problems are resolved most cost-effectively by finding and fixing them as soon as possible to when they are created. The cost to correct an error typically multiplies 2-10 times in each subsequent ‘life-cycle phase’, and similar factors apply to reliability costs. For example, manufacturers with no reliability program in place may have warranty costs as high as 10% or more of their gross sales; whereas manufacturers who implement well-considered, targeted reliability processes can see warranty costs diminish to below 1% of gross sales. On top of these tangible benefits of improved reliability come many intangible benefits including reputation (reliability increases perceived value, and retaining existing customers costs considerably less than acquiring new ones), lower production costs through higher first pass yield in test, and lower risk of product recall and engineering changes. Similarly, total cost of an engineering change can increase by several orders of magnitude when it is made late in the product development cycle. The greatest opportunity to gain benefits of improved reliability is as early as possible.

The importance of reliability management

Importantly though, reliability improvement cannot be switched on in an ad-hoc way, for overnight results. It is the result of a top-level management commitment to provide the necessary resources along with a credible plan. For some, it is a paradigm change. The greatest internal barrier to improving reliability is the understanding of what reliability actually is and represents, and resistance to change. For example, reducing product development cycles or blowouts in schedules is often cited as the biggest reason that activities specifically required for reliability might decide to be skipped. The irony is that some reliability activities focus on standardizing proven high reliability design modules which can simplify manufacturing and shorten product development time. Indeed, the reduced development time for product introduction can be a major advantage of holistic reliability approaches.

Reliability management of modern systems can be particularly difficult, given such systems comprise of not just hardware, but also software, organizational and human elements. For example, nowadays many systems of distributed service (network systems) exist that show emergent behaviour and hence cannot be understood and properly described by looking solely at their constitutive parts. Industrial accidents in the last few decades has clearly shown that the organizational and human factors play a significant role in the risk of system failures and accidents. And when developing models and methods for the analysis of software failures, the physical hardware concept of failure mode does not apply, since software does not follow physical laws of degradation or failure. But despite the challenges, these first principles continue to apply.

Reliability is a fundamental performance characteristic of any modern technological system. Reliability deserves more attention, not only because of its importance to providing the capabilities we seek, but because of the large ROI opportunities.

Even though convinced of the capability and business sense of reliability management, statistical analysis of reliability data can discourage many people – statistical analysis is often required because reliability is influenced by variability. The book below  and included tools allow newcomers to extract the information from reliability data to make sound decisions. The tools presented can be applied across most technologies and industries, and the information and insights gained by doing so can help reduce total ownership costs, and improve capability through better reliability management.


Practical Reliability Data Analysis for Non-Reliability Engineers

Copyright: 2020
Pages: 157
ISBN: 9781630818272

Coming Soon

This practical resource presents basic probabilistic and statistical methods or tools used to extract the information from reliability data to make sound decisions. It consolidates and condenses the reliability data analysis methods most often used in everyday practice into an easy-to-follow guide, while also providing a solid foundation from which to explore more complex methods if desired.

The book provides mathematical and Excel spreadsheet formulas to estimate parameters and confidence bounds (uncertainty) for the most common probability distributions used in reliability analysis. Several other Excel tools are provided to aid users without access to expensive, dedicated, commercial tools. This book and tools were developed by the authors after many years of teaching the fundamentals of reliability data analysis to a broad range of technical and non-technical military and civilian personnel, making it useful for both novice and experienced engineers.


Author bio:

Darcy Brooker is an L2 Director of Program Integration and Interoperability, CASG Joint Systems Division Communication Systems Branch, Department of Defence, Australia. He holds a Master of Engineering in reliability engineering, a Master of Science in information technology, a Master of Science in computer science, a Master of Engineering Science, a Master of Management Studies in project management, a Master of Business Administration in technology management and a Master of Science in operations research/analysis.

To order his book, click here.

Leave a Reply

Your email address will not be published. Required fields are marked *