Anything that people make is going to break. This is inevitable. It's a direct consequence of unavoidable, fundamental laws of the universe.
Part of the art of engineering is to control where, and after how long, things will break. We can't prevent failure entirely, but we can focus our efforts on the most critical items and we can ensure that the complete system – an engine, boat, spacecraft, whatever – is unlikely to fail at all within a certain service life.
I've had this post brewing in my head for a while. Recently, though, the yachting world has been abuzz with discussions of sudden structural failure after the sailing vessel Cheeki Rafiki was lost with all hands in the mid-Atlantic when her keel inexplicably fell off. So this seems like an appropriate time for a discussion of systems criticality.
Mayday versus PanPan
Any complex system is going to have a multitude of possible failure modes. Some are single points of failure, others are backed up by redundancies. Some would spell instant disaster, others can be mitigated by good seamanship, and still others are just a nuisance. Since we do not have infinite resources to throw at all possible points of weakness, we need to prioritize possible failures by their criticality.
There are many popular systems for calculating and classifying the criticality of a potential failure. (FMECA, as it's known, is an entire professional discipline.) All that analysis is used to condense the risks down into some form of high-level ranking system that helps us decide where our biggest risks, and therefore our biggest priorities, are. Probably the most widely known high-level ranking system is the one used by NASA, and so it's the one I'll use here.
For every possible failure mode, we assign a criticality level. The criticality level is the answer to the question "If this thing fails and shit hits the fan, how bad will it get?" The answer will be one of:
|1||Single failure which could result in loss of life or vehicle.|
|1R||Redundant hardware item(s), all of which if failed, could cause loss of life or vehicle.|
|2||Single failure which could result in loss of mission.|
|2R||Redundant hardware item(s), all of which if failed, could cause loss of mission.|
Or, to put it in more nautical terms:
- If this thing breaking would kill someone, instantly sink the ship or otherwise trigger a "Mayday" call, it is a Criticality 1 item.
- If this thing breaking would trigger a "PanPan" call, ending the voyage and leaving you limping back to port (but with your crew alive and well), it is a Criticality 2 item.
- If you could safely continue the voyage if this thing broke, it's a Criticality 3 item.
It might be tempting to put many, many things in what aviators would call "Safety of Flight", i.e. Criticality 1, but that would be a mistake. Our resources are limited and the entire point of this system is to allocate them appropriately; over-building everything is expensive and leads to a heavy, poor-performing design.
There can also be pressure to move in the opposite direction. More-critical items, in such a system, need much closer scrutiny and, in aerospace work, waivers and exemptions for Criticality 1 items. The failure criticality, though, has to be decided based on engineering facts alone. As an example, the motor casing seal that destroyed the Space Shuttle Challenger was listed on paper as being Criticality 1R – even though it was, by that time, well understood that the redundancy in question could only kick in during a specific one-third-of-a-second window in the startup sequence, and then only under a narrow set of weather conditions.
By way of example, then, let's try to categorize just a few of the failure points aboard ship. To do this completely, we'd consider every single thing that could break, and then come up with something like:
System: Main propulsion gearbox.
Failure mode: Loss of gear oil due to seal, gasket or plug failure.
Consequences: Complete loss of main propulsion and destruction of gearbox. Unrepairable while at sea. Probable loss of 2-4 months and $5000.
Mitigation plan: Check gear oil level daily. Check drain plug torque every 100 hours. Check for oil leakage around seals and gaskets after every engine shutdown.
Some other systems to consider:
- Powertrain failures, on boats that rely on engine power, are usually Criticality 2 items. On twin-engine boats that can run on one engine, many powertrain issues might be Criticality 2R.
- Most of the rig hardware on a cruising sailboat should be 2 or 2R. Boats get dismasted all the time and it is folly to believe that we are immune to this risk; it is also folly to design a boat where dismasting would be a "lost at sea with all hands" incident.
- Steering gear should be 2R – sufficient redundancy at all points that the yacht is still steerable with a partial failure, and designed so that even if the steering system is completely destroyed, the boat and her crew can still be kept safe through good seamanship.
- Propane systems are usually 1R, in that many of their potential problems have "massive explosion" as the end result and we build in sensors and automatic shutoffs to bring things back under control if there's a leak.
- Keel bolts as individual items are, in a well-designed fin keel yacht, 1R. Failure of any one bolt should not jeopardize the ship, but failure of several could cause the loss of the keel – a sudden, probably unsurvivable event. It is quite possible, through poor design combined with lack of maintenance, to turn the keel bolts into a Criticality 1 system, where any one failure will cause cascading overloads that lead to the loss of the keel and therefore of the ship and crew.
Attention where it's needed
The entire point of this exercise is to figure out where to focus our design, construction, inspection and maintenance resources.
Criticality 1 items absolutely must not fail, ever, during the life of the ship. If they do, someone dies. I am of the opinion that any yacht containing a Criticality 1 system is unsuitable for offshore use. Nevertheless, if a boat does have such a system, it must
- be designed to such high standards of strength that everything else should fall to pieces around it before it fails, and
- be subject to regularly scheduled inspections (by whatever means are necessary) and scheduled preventive maintenance. You do not skip maintenance on Criticality 1 items, ever.
Criticality 1R items don't have the "must never fail" constraint at the design phase, but the redundant system as a whole requires the same level of design integrity, care and maintenance as something with a single point of failure. With the keel bolts, for example, you can only rely on redundancy if the individual pieces are routinely pulled and inspected, and if the design allows for the surviving bolts and their supporting structure to be well within safe limits even if several are fractured.
Criticality 2 and 2R items allow a bit more room to tolerate a failure, and it may be OK to rely more heavily on inspections and scheduled servicing for safety. As an example, most offshore sailboats have their standing rigging inspected in full before every voyage, and the most critical components carefully monitored during the voyage. The entire standing rig is usually replaced after a certain number of years as a precaution. But, if a chainplate or toggle were to fail and bring down part of the rig, there should be enough redundancy and enough mitigation options available to keep the crew safe and get them home.
No, you don't need to do a formal FMECA analysis for your boat. It is important, though – particularly when designing, building or buying a vessel – to carefully consider what the consequences of each possible failure would be, and to plan accordingly.