Failed Promises
n5321 | 2025年6月19日 07:03
For some time now, many of the most prominent and colorful pages in Mechanical Engineering magazine have been filled by advertisements for computer software. However, there is a difference between the most recent ads and those of just a few years earlier. In 1990, for example, many software developers emphasized the reliability and ease of use of their packages, with one declaring itself the “most reliable way to take the heat, handle the pressure, and cope with the stress” while another promised to provide “trusted solutions to your design challenges.”
More recent advertising copy is a bit more subdued, with fewer implied promises that the software is going to do the work of the engineer—or take the heat or responsibility. The newer message is that the buck stops with the engineer. Software packages might provide “the right tool for the job,” but the engineer works the tool. A sophisticated system might be “the ultimate testing ground for your ideas,” but the ideas are no longer the machine’s, they are the engineer’s. Options may abound in software packages, but the engineer makes a responsible choice. This is as it should be, of course, but things are not always as they should be, and that is no doubt why there have been subtle and sometimes not-so-subtle changes in technical software marketing and its implied promises. Civil Engineering has also run software advertisements, albeit less prominent and colorful ones. Their messages, explicit or implicit, are more descriptive than promising. Nevertheless, the advertisements also contain few caveats about limitations, pitfalls, or downright errors that might be encountered in using prepackaged, often general-purpose software for a specific engineering design or analysis. The implied optimism of the software advertisements stands in sharp contrast to the concerns about the use of software that have been expressed with growing frequency in the pages of the same engineering magazines. The American Society of Civil Engineers, publisher of Civil Engineering and a host of technical journals and publications full of theoretical and applied discussions of computers and their uses, has among its many committees one on “guidelines for avoiding failures caused by misuse of civil engineering software.” The committee’s parent organization, the Technical Council on Forensic Engineering, was the sponsor of a cautionary session on computer use at the society’s 1992 annual meeting, and one presenter titled his paper, “Computers in Civil Engineering: A Time Bomb!” In simultaneous sessions at the same meeting, other equally fervid engineers were presenting computer-aided designs and analyses of structures of the future. There is no doubt that computer-aided design, manufacturing, and engineering have provided benefits to the profession and to humankind. Engineers are attempting and completing more complex and time-consuming analyses that involve many steps (and therefore opportunities for error) and that might not have been considered practicable in slide-rule days. New hardware and software have enabled more ambitious and extensive designs to be realized, including some of the dramatic structures and ingenious machines that characterize the late twentieth century. Today’s automobiles, for example, possess better crashworthiness and passenger protection because of advanced finite-element modeling, in which a complex structure such as a stylish car body is subdivided into more manageable elements, much as we might construct a gracefully curving walkway out of a large number of rectilinear bricks. For all the achievements made possible by computers, there is growing concern in the engineering-design community that there are numerous pitfalls that can be encountered using software packages. All software begins with some fundamental assumptions that translate to fundamental limitations, but these are not always displayed prominently in advertisements. Indeed, some of the limitations of software might be equally unknown to the vendor and to the customer. Perhaps the most damaging limitation is that it can be misused or used inappropriately by an inexperienced or overconfident engineer. The surest way to drive home the potential dangers of misplaced reliance on computer software is to recite the incontrovertible evidence of failures of structures, machines, and systems that are attributable to use or misuse of software. One such incident occurred in the North Sea in August 1991, when the concrete base of a massive Norwegian oil platform, designated Sleipner A, was being tested for leaks and mechanical operation prior to being mated with its deck. The base of the structure consisted of two dozen circular cylindrical reinforced-concrete cells. Some of the cells were to serve as drill shafts, others as storage tanks for oil, and the remainder as ballast tanks to place and hold the platform on the sea bottom. Some of the tanks were being filled with water when the operators heard a loud bang, followed by significant vibrations and the sound of a great amount of running water. After eight minutes of trying to control the water intake, the crew abandoned the structure. About eighteen minutes after the first bang was heard, Sleipner A disappeared into the sea, and forty-five seconds later a seismic event that registered a 3 on the Richter scale was recorded in southern Norway. The event was the massive concrete base striking the sea floor. An investigation of the structural design of Sleipner A’s base found that the differential pressure on the concrete walls was too great where three cylindrical shells met and left a triangular void open to the full pressure of the sea. It is precisely in the vicinity of such complex geometry that computer-aided analysis can be so helpful, but the geometry must be modeled properly. Investigators found that “unfavorable geometrical shaping of some finite elements in the global analysis … in conjunction with the subsequent post-processing of the analysis results … led to underestimation of the shear forces at the wall supports by some 45%.” (Whether or not due to the underestimation of stresses, inadequate steel reinforcement also contributed to the weakness of the design.) In short, no matter how sound and reliable the software may have been, its improper and incomplete use led to a structure that was inadequate for the loads to which it was subjected. In its November 1991 issue, the trade journal Offshore Engineer reported that the errors in analysis of Sleipner A “should have been picked up by internal control procedures before construction started.” The investigators also found that “not enough attention was given to the transfer of experience from previous projects.” In particular,trouble with an earlier platform, Statfjord A, which suffered cracking in the same critical area, should have drawn attention to the flawed detail. (A similar neglect of prior experience occurred, of course, just before the fatal Challenger accident, when the importance of previous O-ring problems was minimized.) Prior experience with complex engineering systems is not easily built into general software packages used to design advanced structures and machines. Such experience often does not exist before the software is applied, and it can be gained only by testing the products designed by the software. A consortium headed by the Netherlands Foundation for the Coordination of Maritime Research once scheduled a series of full-scale collisions between a single- and a double-hulled ship “to test the [predictive] validity of computer modelling analysis and software.” Such drastic measures are necessary because makers and users of software and computer models cannot ignore the sine qua non of sound engineering—broad experience with what happens in and what can go wrong in the real world. Computer software is being used more and more to design and control large and complex systems, and in these cases it may not be the user who is to blame for accidents. Advanced aircraft such as the F-22 fighter jet employ on-board computers to keep the plane from becoming aerodynamically unstable during maneuvers. When an F-22 crashed during a test flight in 1993, according to a New York Times report, “a senior Air Force official suggested that the F-22’s computer might not have been programmed to deal with the precise circumstances that the plane faced just before it crash-landed.” What the jet was doing, however, was not unusual for a test flight. During an approach about a hundred feet above the runway, the afterburners were turned on to begin an ascent—an expected maneuver for a test pilot—when “the plane’s nose began to bob up and down violently.” The Times reported the Air Force official as saying, “It could have been a computer glitch, but we just don’t know.” Those closest to questions of software safety and reliability worry a good deal about such “fly by wire” aircraft. They also worry about the growing use of computers to control everything from elevators to medical devices. The concern is not that computers should not control such things, but rather that the design and development of the software must be done with the proper checks and balances and tests to ensure reliability as much as is humanly possible. A case study that has become increasingly familiar to software designers unfolded during the mid-1980s, when a series of accidents plagued a high-powered medical device, the Therac-25. The Therac-25 was designed by Atomic Energy of Canada Limited (AECL) to accelerate and deliver a beam of electrons at up to 25 mega-electron-volts to destroy tumors embedded in living tissue. By varying the energy level of the electrons, tumors at different depths in the body could be targeted without significantly affecting surrounding healthy tissue, because beams of higher energy delivered the maximum radiation dose deeper in the body and so could pass through the healthy parts. Predecessors of the Therac-25 had lower peak energies and were less compact and versatile. When they were designed in the early 1970s, various protective circuits and mechanical interlocks to monitor radiation prevented patients from receiving an overdose. These earlier machines were later retrofitted with computer control, but the electrical and mechanical safety devices remained in place. Computer control was incorporated into the Therac-25 from the outset. Some safety features that had depended on hardware were replaced with software monitoring. “This approach,” according to Nancy Leveson, a leading software safety and reliabilty expert, and a student of hers, Clark Turner, “is becoming more common as companies decide that hardware interlocks and backups are not worth the expense, or they put more faith (perhaps misplaced) on software than on hardware reliability.” Furthermore, when hardware is still employed, it is often controlled by software. In their extensive investigation of the Therac-25 case, Leveson and Turner recount the device’s accident history, which began in Marietta, Georgia. On June 3, 1985, at the Kennestone Regional Oncology Center, the Therac-25 was being used to provide follow-up radiation treatment for a woman who had undergone a lumpectomy. When she reported being burned, the technician told her it was impossible for the machine to do that, and she was sent home. It was only after a couple of weeks that it became evident the patient had indeed suffered a severe radiation burn. It was later estimated she received perhaps two orders of magnitude more radiation than that normally prescribed. The woman lost her breast and the use of her shoulder and arm, and she suffered great pain. About three weeks after the incident in Georgia, another woman was undergoing Therac-25 treatment at the Ontario Cancer Foundation for a carcinoma of the cervix when she complained of a burning sensation. Within four months she died of a massive radiation overdose. Four additional cases of overdose occurred, three resulting in death. Two of these were at the Yakima Valley Memorial Hospital in Washington, in 1985 and 1987, and two at the East Texas Cancer Center, in Tyler, in March and April 1986. These latter cases are the subject of the title tale of a collection of horror stories on design, technology, and human error, Set Phasers on Stun, by Steven Casey. Leveson and Turner relate the details of each of the six Therac-25 cases, including the slow and sometimes less-than-forthright process whereby the most likely cause of the overdoses was uncovered. They point out that “concluding that an accident was the result of human error is not very helpful and meaningful,” and they provide an extensive analysis of the problems with the software controlling the machine. According to Leveson and Turner, “Virtually all complex software can be made to behave in an unexpected fashion under certain conditions,” and this is what appears to have happened with the Therac-25. Although they admit that to the day of their writing “some unanswered questions” remained, Leveson and Turner report in considerable detail what appears to have been a common feature in the Therac-25 accidents. The parameters for each patient’s prescribed treatment were entered at the computer keyboard and displayed on the screen before the operator. There were two fundamental modes of treatment, X ray (employing the machine’s full 25 mega-electron-volts) and the relatively low-power electron beam. The first was designated by typing in an “x” and the latter by an “e.” Occasionally, and evidently in at least some if not all of the accident cases, the Therac operator mistyped an “x” for an “e,” but noticed the error before triggering the beam. An “edit” of the input data was performed by using the “arrow up” key to move the cursor to the incorrect entry, changing it, and then returning to the bottom of the screen, where a “beam ready” message was the operator’s signal to enter an instruction to proceed, administering the radiation dose. Unfortunately, in some cases the editing was done so quickly by the fast-typing operators that not all of the machine’s functions were properly reset before the treatment was triggered. Exactly how much overdose was administered, and thus whether it was fatal, depended upon the installation, since “the number of pulses delivered in the 0.3 second that elapsed before interlock shutoff varied because the software adjusted the start-up pulse-repetition frequency to very different values on different machines.”
Anomalous, eccentric, sometimes downright bizarre, and always unexpected behavior of computers and their software is what ties together the horror stories that appear in each issue of Software Engineering Notes, an “informal newsletter” published quarterly by the Association for Computing Machinery. Peter G. Neumann, chairman of the ACM Committee on Computers and Public Policy, is the moderator of the newsletter’s regular department, “Risks to the Public in Computers and Related Systems,” in which contributors pass on reports of computer errors and glitches in applications ranging from health care systems to automatic teller machines. Neumann also writes a regular column, “Inside Risks,” for the magazine Communications of the ACM, in which he discusses some of the more generic problems with computers and software that prompt the many horror tales that get reported in newspapers, magazines, and professional journals and on electronic bulletin boards. Unfortunately, a considerable amount of the software involved in computer-related failures and malfunctions reported in such forums is produced anonymously, packaged in a black box, and poorly documented. The Therac-25 software, for example, was designed by a programmer or programmers about whom no information was forthcoming, even during a lawsuit brought against AECL. Engineers and others who use such software might reflect upon how contrary to normal scientific and engineering practice its use can be. Responsible engineers and scientists approach new software, like a new theory, with healthy skepticism. Increasingly often, however, there is no such skepticism when the most complicated of software is employed to solve the most complex problems. No software can ever be proven with absolute certainty to be totally error-free, and thus its design, construction, and use should be approached as cautiously as that of any major structure, machine, or system upon which human lives depend. Although the reputation and track record of software producers and their packages can be relied upon to a reasonable extent, good engineering involves checking them out. If the black box cannot be opened, a good deal of confidence in it and understanding of its operation can be inferred by testing.