Cover
About the Authors
List of Figures
List of Tables
Series Editor's Foreword
Series Foreword Second Edition
Series Foreword First Edition
Foreword First Edition
Preface Second Edition
Preface First Edition
1. Quality versus Reliability
2. Gaining Competitive Advantage
Acknowledgments
Glossary
Part I: Reliability and Software Quality – It's a Matter of Survival
1. 1 The Need for a New Paradigm for Hardware Reliability and Software Quality
  1. 1.1 Rapidly Shifting Challenges for Hardware Reliability and Software Quality
  2. 1.2 Gaining Competitive Advantage
  3. 1.3 Competing in the Next Decade – Winners Will Compete on Reliability
  4. 1.4 Concurrent Engineering
  5. 1.5 Reducing the Number of Engineering Change Orders at Product Release
  6. 1.6 Time‐to‐Market Advantage
  7. 1.7 Accelerating Product Development
  8. 1.8 Identifying and Managing Risks
  9. 1.9 ICM, a Process to Mitigate Risk
  10. 1.10 Software Quality Overview
  11. References
  12. Further Reading
2. 2 Barriers to Implementing Hardware Reliability and Software Quality
  1. 2.1 Lack of Understanding
  2. 2.2 Internal Barriers
  3. 2.3 Implementing Change and Change Agents
  4. 2.4 Building Credibility
  5. 2.5 Perceived External Barriers
  6. 2.6 Time to Gain Acceptance
  7. 2.7 External Barrier
  8. 2.8 Barriers to Software Process Improvement
3. 3 Understanding Why Products Fail
  1. 3.1 Why Things Fail
  2. 3.2 Parts Have Improved, Everyone Can Build Quality Products
  3. 3.3 Hardware Reliability and Software Quality – The New Paradigm
  4. 3.4 Reliability vs. Quality Escapes
  5. 3.5 Why Software Quality Improvement Programs Are Unsuccessful
  6. Further Reading
4. 4 Alternative Approaches to Implementing Reliability
  1. 4.1 Hiring Consultants for HALT Testing
  2. 4.2 Outsourcing Reliability Testing
  3. 4.3 Using Consultants to Develop and Implement a Reliability Program
  4. 4.4 Hiring Reliability Engineers
Part II: Unraveling the Mystery
1. 5 The Product Life Cycle
  1. 5.1 Six Phases of the Product Life Cycle
  2. 5.2 Risk Mitigation
  3. 5.3 The ICM Process for a Small Company
  4. 5.4 Design Guidelines
  5. 5.5 Warranty
  6. Further Reading
2. 6 Reliability Concepts
  1. 6.1 The Bathtub Curve
  2. 6.2 Mean Time between Failure
  3. 6.3 Warranty Costs
  4. 6.4 Availability
  5. 6.5 Reliability Growth
  6. 6.6 Reliability Demonstration Testing
  7. 6.7 Maintenance and Availability
  8. 6.8 Component Derating
  9. 6.9 Component Uprating
  10. Reference
  11. Further Reading
3. 7 FMEA
  1. 7.1 Benefits of FMEA
  2. 7.2 Components of FMEA
  3. 7.3 Preparing for the FMEA
  4. 7.4 Barriers to the FMEA Process
  5. 7.5 FMEA Ground Rules
  6. 7.6 Using Macros to Improve FMEA Efficiency and Effectiveness
  7. 7.7 Software FMEA
  8. 7.8 Software Fault Tree Analysis (SFTA)
  9. 7.9 Process FMEAs
  10. 7.10 FMMEA
4. 8 The Reliability Toolbox
  1. 8.1 The HALT Process
  2. 8.2 Highly Accelerated Stress Screening (HASS)
  3. 8.3 HALT and HASS Test Chambers
  4. 8.4 Accelerated Reliability Growth (ARG)
  5. 8.5 Accelerated Early Life Test (ELT)
  6. 8.6 SPC Tool
  7. 8.7 FIFO Tool
  8. References
  9. Further Reading
5. 9 Software Quality Goals and Metrics
  1. 9.1 Setting Software Quality Goals
  2. 9.2 Software Metrics
  3. 9.3 Lines of Code (LOC)
  4. 9.4 Defect Density
  5. 9.5 Defect Models
  6. 9.6 Defect Run Chart
  7. 9.7 Escaped Defect Rate
  8. 9.8 Code Coverage
  9. References
  10. Further Reading
6. 10 Software Quality Analysis Techniques
  1. 10.1 Root Cause Analysis
  2. 10.2 The 5 Whys
  3. 10.3 Cause and Effect Diagrams
  4. 10.4 Pareto Charts
  5. 10.5 Defect Prevention, Defect Detection, and Defensive Programming
  6. 10.6 Effort Estimation
  7. Reference
  8. Further Reading
7. 11 Software Life Cycles
  1. 11.1 Waterfall
  2. 11.2 Agile
  3. 11.3 CMMI
  4. 11.4 How to Choose a Software Life Cycle
  5. Reference
  6. Further Reading
8. 12 Software Procedures and Techniques
  1. 12.1 Gathering Requirements
  2. 12.2 Documenting Requirements
  3. 12.3 Documentation
  4. 12.4 Code Comments
  5. 12.5 Reviews and Inspections
  6. 12.6 Traceability
  7. 12.7 Defect Tracking
  8. 12.8 Software and Hardware Integration
  9. References
  10. Further Reading
9. 13 Why Hardware Reliability and Software Quality Improvement Efforts Fail
  1. 13.1 Lack of Commitment to the Reliability Process
  2. 13.2 Inability to Embrace and Mitigate Technologies Risk Issues
  3. 13.3 Choosing the Wrong People for the Job
  4. 13.4 Inadequate Funding
  5. 13.5 Inadequate Resources
  6. 13.6 MIL‐HDBK 217 – Why It Is Obsolete
  7. 13.7 Finding But Not Fixing Problems
  8. 13.8 Nondynamic Testing
  9. 13.9 Vibration Testing Too Difficult to Implement
  10. 13.10 The Impact of Late Hardware or Late Software Delivery
  11. 13.11 Supplier Reliability
  12. Reference
  13. Further Reading
10. 14 Supplier Management
  1. 14.1 Purchasing Interface
  2. 14.2 Identifying Your Critical Suppliers
  3. 14.3 Develop a Thorough Supplier Audit Process
  4. 14.4 Develop Rapid Nonconformance Feedback
  5. 14.5 Develop a Materials Review Board (MRB)
  6. 14.6 Counterfeit Parts and Materials
Part III: Steps to Successful Implementation
1. 15 Establishing a Reliability Lab
  1. 15.1 Staffing for Reliability
  2. 15.2 The Reliability Lab
  3. 15.3 Facility Requirements
  4. 15.4 Liquid Nitrogen Requirements
  5. 15.5 Air Compressor Requirements
  6. 15.6 Selecting a Reliability Lab Location
  7. 15.7 Selecting a Halt Test Chamber
  8. Reference
2. 16 Hiring and Staffing the Right People
  1. 16.1 Staffing for Reliability
  2. 16.2 Staffing for Software Engineers
  3. 16.3 Choosing the Wrong People for the Job
3. 17 Implementing the Reliability Process
  1. 17.1 Reliability Is Everyone's Job
  2. 17.2 Formalizing the Reliability Process
  3. 17.3 Implementing the Reliability Process
  4. 17.4 Rolling Out the Reliability Process
  5. 17.5 Developing a Reliability Culture
  6. 17.6 Setting Reliability Goals
  7. 17.7 Training
  8. 17.8 Product Life Cycle Defined
  9. 17.9 Proactive and Reactive Reliability Activities
  10. Further Reading
Part IV: Reliability and Quality Process for Product Development
1. 18 Product Concept Phase
  1. 18.1 Reliability Activities in the Product Concept Phase
  2. 18.2 Establish the Reliability Organization
  3. 18.3 Define the Reliability Process
  4. 18.4 Define the Product Reliability Requirements
  5. 18.5 Capture and Apply Lessons Learned
  6. 18.6 Mitigate Risk
2. 19 Design Concept Phase
  1. 19.1 Reliability Activities in the Design Concept Phase
  2. 19.2 Set Reliability Requirements and Budgets
  3. 19.3 Define Reliability Design Guidelines
  4. 19.4 Revise Risk Mitigation
  5. 19.5 Schedule Reliability Activities and Capital Budgets
  6. 19.6 Decide Risk Mitigation Sign‐off Day
  7. 19.7 Reflect on What Worked Well
3. 20 Product Design Phase
  1. 20.1 Product Design Phase
  2. 20.2 Reliability Estimates
  3. 20.3 Implementing Risk Mitigation Plans
  4. 20.4 Design for Reliability Guidelines (DFR)
  5. 20.5 Design FMEA
  6. 20.6 Installing a Failure Reporting Analysis and Corrective Action System
  7. 20.7 HALT Planning
  8. 20.8 HALT Test Development
  9. 20.9 Risk Mitigation Meeting
  10. Further Reading
4. 21 Design Validation Phase
  1. 21.1 Design Validation
  2. 21.2 Using HALT to Precipitate Failures
  3. 21.3 Proof of Screen (POS)
  4. 21.4 Highly Accelerated Stress Screen (HASS)
  5. 21.5 Operate FRACAS
  6. 21.6 Design FMEA
  7. 21.7 Closure of Risk Issues
  8. Further Reading
5. 22 Software Testing and Debugging
  1. 22.1 Unit Tests
  2. 22.2 Integration Tests
  3. 22.3 System Tests
  4. 22.4 Regression Tests
  5. 22.5 Security Tests
  6. 22.6 Guidelines for Creating Test Cases
  7. 22.7 Test Plans
  8. 22.8 Defect Isolation Techniques
  9. 22.9 Instrumentation and Logging
  10. Further Reading
6. 23 Applying Software Quality Procedures
  1. 23.1 Using Defect Model to Create Defect Run Chart
  2. 23.2 Using Defect Run Chart to Know When You Have Achieved the Quality Target
  3. 23.3 Using Root Cause Analysis on Defects to Improve Organizational Quality Delivery
  4. 23.4 Continuous Integration and Test
  5. Further Reading
7. 24 Production Phase
  1. 24.1 Accelerating Design Maturity
  2. 24.2 Reliability Growth
  3. 24.3 Design and Process FMEA
  4. Further Reading
8. 25 End‐of‐Life Phase
  1. 25.1 Managing Obsolescence
  2. 25.2 Product Termination
  3. 25.3 Project Assessment
  4. Further Reading
9. 26 Field Service
  1. 26.1 Design for Ease of Access
  2. 26.2 Identify High Replacement Assemblies (FRUs)
  3. 26.3 Wearout Replacement
  4. 26.4 Preemptive Servicing
  5. 26.5 Servicing Tools
  6. 26.6 Service Loops
  7. 26.7 Availability or Repair Time Turnaround
  8. 26.8 Avoid System Failure Through Redundancy
  9. 26.9 Random versus Wearout Failures
  10. Further Reading
Appendix A:
1. A.1 Reliability Consultants
2. A.2 Graduate Reliability Engineering Programs and Reliability Certification Programs
3. A.3 Reliability Professional Organizations and Societies
4. A.4 Reliability Training Classes
5. A.5 Environmental Testing Services
6. A.6 HALT Test Chambers
7. A.7 Reliability Websites
8. A.8 Reliability Software
9. A.9 Reliability Seminars and Conferences
10. A.10 Reliability Journals
Appendix B:
1. MTBF, FIT, and PPM Conversions
2. Mean Time Between Failure (MTBF)
3. Estimating Field Failures
Index
End User License Agreement

List of Tables

f13
1. Table I.1 Business size definition.
Chapter 5
1. Table 5.1 Functional activities for cross‐functional integration of reliability.
Chapter 6
1. Table 6.1 Failures in the warranty period w/different MTBFs.
2. Table 6.2 Advantages of proactive reliability growth.
3. Table 6.3 RDT multiplier for failure‐free runtime.
4. Table 6.4 FMMEA for fan bearings (detection omitted).
5. Table 6.5 Sensors to monitor for overstress in wearout degradation.
6. Table 6.6 Sensors to monitor bearing degradation.
7. Table 6.7 Component grade temperature classifications.
Chapter 7
1. Table 7.1 The FMEA spreadsheet.
2. Table 7.2 RPN ranking table.
3. Table 7.3 FMEA parking lot for important issue that are not part of the FMEA.
4. Table 7.4 Common software failure modes.
5. Table 7.5 Common causes for software failure.
6. Table 7.6 Failure modes and associated possible causes.
Chapter 8
1. Table 8.1 Agreed upon HALT limits.
2. Table 8.2 HALT profile for test setup checkout.
3. Table 8.3 Temperature step stress with power cycle and end of each step.
4. Table 8.4 Vibration step stress.
5. Table 8.5 Temperature and vibration step stress.
6. Table 8.6 Rapid thermal cycling.
7. Table 8.7 Slow temperature ramp.
8. Table 8.8 Slow temperature ramp with constantly varying vibration level.
Chapter 11
1. Table 11.1 CMMI process areas.
2. Table 11.2 CMMI maturity levels.
3. Table 11.3 Life cycle comparison.
Chapter 14
1. Table 14.1 Industry standards for managing counterfeit material risk.
Chapter 15
1. Table 15.1 Annual sales dollars relative to typical warranty costs.
2. Table 15.2 HALT facility decision guide.
3. Table 15.3 HALT machine decision matrix.
Chapter 16
1. Table 16.1 Reliability skill set for various positions.
Chapter 17
1. Table 17.1 Reliability activities for each phase of the product life cycle.
2. Table 17.2 Reliability activities – what's required, recommended, and nice to ha...
Chapter 18
1. Table 18.1 Product concept phase reliability activities.
Chapter 19
1. Table 19.1 Design concept phase reliability activities.
Chapter 20
1. Table 20.1 Reliability activities for the product design phase.
2. Table 20.2 Common accelerated life test stresses.
3. Table 20.3 Environmental stress tests.
Chapter 21
1. Table 21.1 Reliability activities in the design validation phase.
2. HALT Profile test limits and test times.
Chapter 24
1. Table 24.1 Reliability activities in the production ramp Phase 5.
2. Table 24.2 Reliability activities in the production release Phase 6.
2
1. Table B.1 Conversion tables for FIT to MTBF and PPM.
2. Table B.2 Factorials.
3. Table B.3 Repairable versus nonrepairable systems still operating (in MTBF time ...

List of Illustrations

Chapter 1
1. Figure 1.1 Product cost is determined early in development.
2. Figure 1.2 Cost to fix a design increases an order of magnitude with each subse...
3. Figure 1.3 The reliability process reduces the number of ECOs required after pr...
4. Figure 1.4 Including reliability in concurrent engineering reduces time to mark...
5. Figure 1.5 Product introduction relative to competitors.
6. Figure 1.6 The ICM process.
Chapter 2
1. Figure 2.1 Overcoming reliability hurdles bring significant rewards.
Chapter 5
1. Figure 5.1 The six phases of the product life cycle.
2. Figure 5.2 The ICM process.
3. Figure 5.3 A risk mitigation program (ICM) needs to address risk issues in all ...
Chapter 6
1. Figure 6.1 The bathtub curve (timescale is logarithmic).
2. Figure 6.2 Cumulative failure curve.
3. Figure 6.3 Light bulb theoretical example.
4. Figure 6.4 Availability as a function of MTBF and MTTR. Note: The curve has a s...
5. Figure 6.5 Design maturity testing – accept/reject criteria.
6. Figure 6.6 Number of fan failures vs. run time.
7. Figure 6.7 Mechanism that can cause degradation and failure.
8. Figure 6.8 PHM data collection and processing to detect degradation..
Chapter 7
1. Figure 7.1 Functional block diagram.
2. Figure 7.2 Filled‐out functional block diagram.
3. Figure 7.3 Schematic diagram of a flashlight.
4. Figure 7.4 Functional block diagram of a flashlight.
5. Figure 7.5 Functional block diagram of a flashlight using Post‐its.
6. Figure 7.6 Fault tree logic symbols.
7. Figure 7.7 Fault tree diagram for flashlight using Post‐its.
8. Figure 7.8 Logic flow diagram.
9. Figure 7.9 Fault tree logic diagram.
10. Figure 7.10 Flash light fault tree logic diagram.
11. Figure 7.11 Functional block diagram for the flashlight process.
12. Figure 7.12 Example of a SFTA for an execution flow failure.
Chapter 8
1. Figure 8.1 Pareto of failures.
2. Figure 8.2 HALT failure percentage by stress type.
3. Figure 8.3 Product design specification limits.
4. Figure 8.4 Design margin.
5. Figure 8.5 Some products fail product spec.
6. Figure 8.6 HALT increases design margin.
7. Figure 8.7 Soft and hard failures.
8. Figure 8.8 Impact of HALT on design margins.
9. Figure 8.9 Two heat exchangers placed in front of chamber forced air.
10. Figure 8.10 Test setup profile to checkout connections and functionality.
11. Figure 8.11 Temperature step stress with power cycle and end of each step.
12. Figure 8.12 Vibration step stress.
13. Figure 8.13 Temperature and vibration step stress.
14. Figure 8.14 Rapid thermal cycling.
15. Figure 8.15 Slow temperature ramp.
16. Figure 8.16 Slow temperature ramp with constantly varying vibration level.
17. Figure 8.17 HASS stress levels.
18. Figure 8.18 The bathtub curve.
19. Figure 8.19 HASA plan.
20. Figure 8.20 A HALT chamber has six simultaneous degrees of freedom (movement).
21. Figure 8.21 ARG process flow.
22. Figure 8.22 Accelerated reliability growth.
23. Figure 8.23 ARG and ELT acceleration test plans.
24. Figure 8.24 Selective process control.
Chapter 9
1. Figure 9.1 Quality ROI chart (financial impact of escapes is low).
2. Figure 9.2 Quality ROI chart (financial impact of escapes is high).
3. Figure 9.3 Sample line counts.
4. Figure 9.4 Defect run chart 1.
5. Figure 9.5 Defect run chart 2.
6. Figure 9.6 Comparative escape rates.
Chapter 10
1. Figure 10.1 Generic fishbone diagram.
2. Figure 10.2 Sample fishbone diagram.
3. Figure 10.3 Sample Pareto chart.
4. Figure 10.4 Code review root cause Pareto.
5. Figure 10.5 Try‐catch code example.
Chapter 11
1. Figure 11.1 Waterfall life cycle.
2. Figure 11.2 Quality processes in a waterfall life cycle.
3. Figure 11.3 Sprint activities.
4. Figure 11.4 Sprint activities in an epic.
Chapter 12
1. Figure 12.1 Sample requirements.
2. Figure 12.2 Sample user stories.
3. Figure 12.3 Code comments example.
4. Figure 12.4 Sample UART HAL code.
Chapter 15
1. Figure 15.1 ESPEC/Qualmark HALT chamber.
Chapter 17
1. Figure 17.1 The six phases of the product life cycle.
2. Figure 17.2 The hardware reliability process.
3. Figure 17.3 Proactive activities in the product life cycle.
Chapter 18
1. Figure 18.1 Product concept phase risk mitigation form.
2. Figure 18.2 Risk severity scale.
3. Figure 18.3 ICM sign‐off required before proceeding to design concept.
Chapter 19
1. Figure 19.1 Opportunity to affect product cost.
2. Figure 19.2 The bathtub curve.
3. Figure 19.3 System MTBF requirement.
4. Figure 19.4 Subsystem MTBF requirement.
5. Figure 19.5 180° of reliability risk mitigation.
6. Figure 19.6 Where to look for new reliability risks.
7. Figure 19.7 The reliability risk mitigation process.
8. Figure 19.8 The ICM is an effective gate to determine if the project should pro...
Chapter 20
1. Figure 20.1 The first phase of the product life cycle.
2. Figure 20.2 Looking forward to identify risk issues.
3. Figure 20.3 Risk mitigation strategies for reliability and performance.
4. Figure 20.4 Risk growth curve shows the rate at which risk issues are identifie...
5. Figure 20.5 DFR guideline for electrolytic capacitor usage.
6. Figure 20.6 HALT planning flow.
7. Figure 20.7 HALT planning checklist.
8. Figure 20.8 HALT development phase.
Chapter 21
1. Figure 21.1 Reliability activities in the validation phase.
2. Figure 21.2 HALT process flow.
3. Figure 21.3 HALT test setup verification test.
4. Figure 21.4 Temperature step stress.
5. Figure 21.5 Vibration step stress.
6. Figure 21.6 Temperature and vibration step stress.
7. Figure 21.7 Rapid thermal cycling (60 °C min⁻¹).
8. Figure 21.8 Slow temperature ramp.
9. Figure 21.9 Slow temperature ramp and sinusoidal amplitude vibration.
10. Figure 21.10 HALT form to log failures.
11. Figure 21.11 HALT graph paper for documenting test.
12. Figure 21.12 HASS stress levels.
13. Figure 21.13 HASS profile.
Chapter 22
1. Figure 22.1 Assert functions can be used with an appropriate header.
2. Figure 22.2 Sample test plan.
3. Figure 22.3 Sample log code.
4. Figure 22.4 Example log file extract.
Chapter 24
1. Figure 24.1 Achieving quality in the production phase.
2. Figure 24.2 Design issue tracking chart.
3. Figure 24.3 Reliability growth chart.
4. Figure 24.4 Reliability growth chart versus predicted.
5. Figure 24.5 Duane curve.
6. Figure 24.6 Phase 5 ARG process flow.
7. Figure 24.7 Typical SPC chart.

Copyright

This edition first published 2019

Edition History

John Wiley & Sons, Ltd (1e, 2003)

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Mark A. Levin, Ted T. Kalal and Jonathan Rodin to be identified as the authors of this work has been asserted in accordance with law.

Registered Offices

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

Editorial Office

The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging‐in‐Publication Data

Names: Levin, Mark A., 1959- author. | Kalal, Ted T., author. | Rodin, Jonathan,

1957- author.

Title: Improving product reliability and software quality : strategies,

tools, process and implementation / Mark A. Levin, Teradyne, Inc.,

California, USA, Ted T. Kalal (Retired), Texas, USA, Jonathan Rodin,

Teradyne, Inc., California, USA.

Other titles: Improving product reliability

Description: 2nd edition. | Hoboken, NJ : John Wiley & Sons, Inc., [2019] |

Revised edition of: Improving product reliability : strategies and

implementation / Mark A. Levin and Ted T. Kalal. c2003. | Includes

bibliographical references and index. |

Identifiers: LCCN 2018061430 (print) | LCCN 2019000421 (ebook) | ISBN

9781119179412 (Adobe PDF) | ISBN 9781119179436 (ePub) | ISBN 9781119179399

(hardcover)

Subjects: LCSH: Reliability (Engineering) | Manufacturing processes--Data

processing. | Computer software--Evaluation.

Classification: LCC TS173 (ebook) | LCC TS173 .L47 2019 (print) | DDC

620/.00452--dc23

LC record available at https://lccn.loc.gov/2018061430

Cover Design: Wiley

Cover Images: (top to bottom): © teekid/Getty Images, © ez_thug/Getty Images, © AK2/Getty Images, Courtesy of Universal Robots/Teradyne Inc.

List of Figures

1. Figure 1.1 Product cost is determined early in development.
2. Figure 1.2 Cost to fix a design increases an order of magnitude with each subse...
3. Figure 1.3 The reliability process reduces the number of ECOs required after pr...
4. Figure 1.4 Including reliability in concurrent engineering reduces time to mark...
5. Figure 1.5 Product introduction relative to competitors.
6. Figure 1.6 The ICM process.
1. Figure 2.1 Overcoming reliability hurdles bring significant rewards.
1. Figure 5.1 The six phases of the product life cycle.
2. Figure 5.2 The ICM process.
3. Figure 5.3 A risk mitigation program (ICM) needs to address risk issues in all ...
1. Figure 6.1 The bathtub curve (timescale is logarithmic).
2. Figure 6.2 Cumulative failure curve.
3. Figure 6.3 Light bulb theoretical example.
4. Figure 6.4 Availability as a function of MTBF and MTTR. Note: The curve has a s...
5. Figure 6.5 Design maturity testing – accept/reject criteria.
6. Figure 6.6 Number of fan failures vs. run time.
7. Figure 6.7 Mechanism that can cause degradation and failure.
8. Figure 6.8 PHM data collection and processing to detect degradation..
1. Figure 7.1 Functional block diagram.
2. Figure 7.2 Filled‐out functional block diagram.
3. Figure 7.3 Schematic diagram of a flashlight.
4. Figure 7.4 Functional block diagram of a flashlight.
5. Figure 7.5 Functional block diagram of a flashlight using Post‐its.
6. Figure 7.6 Fault tree logic symbols.
7. Figure 7.7 Fault tree diagram for flashlight using Post‐its.
8. Figure 7.8 Logic flow diagram.
9. Figure 7.9 Fault tree logic diagram.
10. Figure 7.10 Flash light fault tree logic diagram.
11. Figure 7.11 Functional block diagram for the flashlight process.
12. Figure 7.12 Example of a SFTA for an execution flow failure.
1. Figure 8.1 Pareto of failures.
2. Figure 8.2 HALT failure percentage by stress type.
3. Figure 8.3 Product design specification limits.
4. Figure 8.4 Design margin.
5. Figure 8.5 Some products fail product spec.
6. Figure 8.6 HALT increases design margin.
7. Figure 8.7 Soft and hard failures.
8. Figure 8.8 Impact of HALT on design margins.
9. Figure 8.9 Two heat exchangers placed in front of chamber forced air.
10. Figure 8.10 Test setup profile to checkout connections and functionality.
11. Figure 8.11 Temperature step stress with power cycle and end of each step.
12. Figure 8.12 Vibration step stress.
13. Figure 8.13 Temperature and vibration step stress.
14. Figure 8.14 Rapid thermal cycling.
15. Figure 8.15 Slow temperature ramp.
16. Figure 8.16 Slow temperature ramp with constantly varying vibration level.
17. Figure 8.17 HASS stress levels.
18. Figure 8.18 The bathtub curve.
19. Figure 8.19 HASA plan.
20. Figure 8.20 A HALT chamber has six simultaneous degrees of freedom (movement).
21. Figure 8.21 ARG process flow.
22. Figure 8.22 Accelerated reliability growth.
23. Figure 8.23 ARG and ELT acceleration test plans.
24. Figure 8.24 Selective process control.
1. Figure 9.1 Quality ROI chart (financial impact of escapes is low).
2. Figure 9.2 Quality ROI chart (financial impact of escapes is high).
3. Figure 9.3 Sample line counts.
4. Figure 9.4 Defect run chart 1.
5. Figure 9.5 Defect run chart 2.
6. Figure 9.6 Comparative escape rates.
1. Figure 10.1 Generic fishbone diagram.
2. Figure 10.2 Sample fishbone diagram.
3. Figure 10.3 Sample Pareto chart.
4. Figure 10.4 Code review root cause Pareto.
5. Figure 10.5 Try‐catch code example.
1. Figure 11.1 Waterfall life cycle.
2. Figure 11.2 Quality processes in a waterfall life cycle.
3. Figure 11.3 Sprint activities.
4. Figure 11.4 Sprint activities in an epic.
1. Figure 12.1 Sample requirements.
2. Figure 12.2 Sample user stories.
3. Figure 12.3 Code comments example.
4. Figure 12.4 Sample UART HAL code.
1. Figure 15.1 ESPEC/Qualmark HALT chamber.
1. Figure 17.1 The six phases of the product life cycle.
2. Figure 17.2 The hardware reliability process.
3. Figure 17.3 Proactive activities in the product life cycle.
1. Figure 18.1 Product concept phase risk mitigation form.
2. Figure 18.2 Risk severity scale.
3. Figure 18.3 ICM sign‐off required before proceeding to design concept.
1. Figure 19.1 Opportunity to affect product cost.
2. Figure 19.2 The bathtub curve.
3. Figure 19.3 System MTBF requirement.
4. Figure 19.4 Subsystem MTBF requirement.
5. Figure 19.5 180° of reliability risk mitigation.
6. Figure 19.6 Where to look for new reliability risks.
7. Figure 19.7 The reliability risk mitigation process.
8. Figure 19.8 The ICM is an effective gate to determine if the project should pro...
1. Figure 20.1 The first phase of the product life cycle.
2. Figure 20.2 Looking forward to identify risk issues.
3. Figure 20.3 Risk mitigation strategies for reliability and performance.
4. Figure 20.4 Risk growth curve shows the rate at which risk issues are identifie...
5. Figure 20.5 DFR guideline for electrolytic capacitor usage.
6. Figure 20.6 HALT planning flow.
7. Figure 20.7 HALT planning checklist.
8. Figure 20.8 HALT development phase.
1. Figure 21.1 Reliability activities in the validation phase.
2. Figure 21.2 HALT process flow.
3. Figure 21.3 HALT test setup verification test.
4. Figure 21.4 Temperature step stress.
5. Figure 21.5 Vibration step stress.
6. Figure 21.6 Temperature and vibration step stress.
7. Figure 21.7 Rapid thermal cycling (60 °C min⁻¹).
8. Figure 21.8 Slow temperature ramp.
9. Figure 21.9 Slow temperature ramp and sinusoidal amplitude vibration.
10. Figure 21.10 HALT form to log failures.
11. Figure 21.11 HALT graph paper for documenting test.
12. Figure 21.12 HASS stress levels.
13. Figure 21.13 HASS profile.
1. Figure 22.1 Assert functions can be used with an appropriate header.
2. Figure 22.2 Sample test plan.
3. Figure 22.3 Sample log code.
4. Figure 22.4 Example log file extract.
1. Figure 24.1 Achieving quality in the production phase.
2. Figure 24.2 Design issue tracking chart.
3. Figure 24.3 Reliability growth chart.
4. Figure 24.4 Reliability growth chart versus predicted.
5. Figure 24.5 Duane curve.
6. Figure 24.6 Phase 5 ARG process flow.
7. Figure 24.7 Typical SPC chart.

Series Foreword Second Edition

There is a popular saying, “If you fail to plan, you are planning to fail.” I don't know if there is another discipline in complex product development where this is more true than designing for product reliability. When products are simple, it is possible to achieve high reliability by observing good design practices, but as products become more complex, and include thousands of components and hundreds of thousands of lines of software, a systematic approach is required.

This has played itself out inside of Teradyne over the last decade through two product lines in our Semiconductor Test Division. One product line, the UltraFLEX Test System, was designed internally. Another, the ETS‐800 Test System, was designed in a company that Teradyne acquired in 2008.

The UltraFLEX platform was designed using Teradyne's internal Design for Reliability standards. The principles embodied in those standards are described by the authors. We religiously used an approved parts list of qualified components and suppliers, we analyzed the electrical stress on every circuit, and we calculated predicted reliability for every instrument and the whole system. Once the system was fielded, we tracked MTBF and executed our failure response, analysis, and corrective action system (FRACAS) on repeat failure modes. The result is that the UltraFLEX platform, our most complex product, has a field reliability about three times higher than prior‐generation products. What makes this more remarkable is that the UltraFLEX has the capability to test two or even four more semiconductor devices in parallel compared to prior testers.

During the development of the UltraFLEX and over the past decade, we also began to deploy and came to rely upon more formal methods to improve software reliability. To be frank, our organizational maturity in software reliability lagged behind our hardware best practices. But through the application of tools like defect models, and especially tracking the reliability of deployed software through automated quality monitors, we were able to both improve the quality of the deployed product and also improve our development methods. A key tool we use to evaluate software reliability is a metric we call clean sessions. A clean session is a session where an operator starts up the tester, loads a program, executes a task like developing tests, debugging, or just testing devices, finishes the task, and then unloads the program, without encountering any anomalous behavior. When we started tracking this metric at the launch of the UltraFLEX, only about half of the sessions were clean. It took us nearly five years to get to 95% clean sessions, and this has set a benchmark that our competitors struggle to reach. Through the learning achieved in this long struggle, we have been able to achieve 95% clean sessions within three months of the release of our next‐generation product.

The ETS‐800 is the next generation version of the successful tester for mixed signal and power devices. When Teradyne acquired the business in 2008, there was no formal reliability program in place, but their products were well regarded in the marketplace and reasonably reliable. The ETS‐800 was a big step up in terms of capability from the prior generation. The instruments were two to four times as dense, and the system could support almost twice as many instruments. Further, the tester included a promising new feature that would greatly simplify customer test programs by providing the switching needed to share tester resources between different device pins.

From a functional and performance perspective, the ETS‐800 was a fantastic success. A single ETS‐800 could replace up to eight prior generation testers. But we found out the hard way that the informal approach to reliability that worked for simple products did not work for more complex ones. When we initially fielded the ETS‐800, it was not a reliable tester. The weak link in the design was the inclusion of thousands of mechanical relays. These relays provided superior electrical performance, but are challenging to use from a reliability perspective. Mechanical relays are highly reliable if they are not hot switched, or switched while a current is flowing through the contacts. A hot‐switching event causes an arc across the contacts surface that causes a rapid degradation to the contact surface and the life of the relay. If the relays were designed for reliability, the hot‐switching event could have been avoided. The ETS 800 reliability was an order of magnitude below the much more complex UltraFLEX platform, and this put a blemish on the reputation we worked hard to develop for delivering highly reliable products.

We worked for a long time to try to improve the robustness of the relays, and reduce the occurrence of hot switching without making much progress. Ultimately we decided to redesign all of the instrumentation using guidelines from the Teradyne reliability system. We are just beginning the deployment of the redesigned instruments, but in side‐by‐side testing, they are demonstrating about 100 times higher reliability than the ones that they replace. It was a hard but effective lesson that a systematic approach to hardware reliability and software quality as the authors have described is the best way to achieve both high customer satisfaction and good profits.

Gregory S. Smith

President, Semiconductor Test Division

Teradyne, Inc.

Series Foreword First Edition

Modern engineering products, from individual components to large systems, must be designed and manufactured to be reliable. The manufacturing processes must be performed correctly and with the minimum of variation. All of these aspects impact upon the costs of design, development, manufacture, and use, or, as they are often called, the product's life cycle costs. The challenge of modern competitive engineering is to ensure that life cycle costs are minimized whilst achieving requirements for performance and time to market. If the market for the product is competitive, improved quality and reliability can generate very strong competitive advantages. We have seen the results of this in the way that many products, particularly Japanese cars, machine tools, earthmoving equipment, electronic components, and consumer electronic products have won dominant positions in world markets in the last 30–40 years. Their success has been largely the result of the teaching of the late W. E. Deming, who taught the fundamental connections between quality, productivity, and competitiveness. Today this message is well understood by nearly all the engineering companies that face the new competition, and those that do not understand lose position or fail.

The customers for major systems, particularly the US military, drove the quality and reliability methods that were developed in the West. They reacted to a perceived low achievement by imposing standards and procedures, whilst their suppliers saw little motivation to improve, since they were paid for spares and repairs. The methods included formal systems for quality and reliability management (MIL‐Q‐9858 and MIL‐STD‐758) and methods for predicting and measuring reliability (MIL‐STD‐721, MIL‐HDBK‐217, MILSTD781). MIL‐Q‐9858 was the model for the international standard on quality systems (ISO9000); the methods for quantifying reliability have been similarly developed and applied to other types of products and have been incorporated into other standards such as ISO60300. These approaches have not proved to be effective and their application has been controversial.

By contrast, the Japanese quality movement was led by an industry that learned how quality provided the key to greatly increased productivity and competitiveness, principally in commercial and consumer markets. The methods that they applied were based on an understanding of the causes of variation and failures, and continuous improvements through the application of process controls and the motivation and management of people at work. It is one of history's ironies that the foremost teachers of these ideas were Americans, notably P. Drucker, W.A. Shewhart, W.E. Deming, and J.R Juran.

These two streams of development epitomize the difference between the deductive mentality applied by the Japanese to industry in general, and to engineering in particular, in contrast to the more inductive approach that is typically applied in the West. The deductive approach seeks to generate continuous improvements across a broad front and new ideas are subjected to careful evaluation. The inductive approach leads to inventions and “break‐throughs,” and to greater reliance on “systems” for control of people and processes. The deductive approach allows a clearer view, particularly in discriminating between sense and nonsense. However, it is not as conducive to the development of radical new ideas. Obviously these traits are not exclusive, and most engineering work involves elements of both. However, the overall tendency of Japanese thinking shows in their enthusiasm and success in industrial teamwork and in the way that they have adopted the philosophies of western teachers such as Drucker and Deming, whilst their western competitors have found it more difficult to break away from the mold of “scientific” management, with its reliance on systems and more rigid organizations and procedures.

Unfortunately, the development of quality and reliability engineering has been afflicted with more nonsense than any other branch of engineering. This has been the result of the development of methods and systems for analysis and control that contravene the deductive logic that quality and reliability are achieved by knowledge, attention to detail, and continuous improvement on the part of the people involved. Therefore, it can be difficult for students, teachers, engineers, and managers to discriminate effectively, and many have been led down wrong paths.

In this series we will attempt to provide a balanced and practical source covering all aspects of quality and reliability engineering and management, related to present and future conditions, and to the range of new scientific and engineering developments that will shape future products. The goal of this series is to present practical, cost‐efficient and effective quality and reliability engineering methods and systems.

I hope that the series will make a positive contribution to the teaching and the practice of engineering.

Patrick D.T. O'Connor

February 2003

Improving Product Reliability and Software Quality

Strategies, Tools, Process and Implementation

Copyright

Dedication

About the Authors

List of Figures

List of Tables

Series Editor's Foreword

Series Foreword Second Edition

Series Foreword First Edition