Digital Averaging
Bogus Testing

The Smoking Gun behind the Crash of Flight 587,
Aging Wiring, No Fault Found, and Other
Anomalous Aerospace Mishaps and Accidents

Brent A. Sorensen – Paul W. Sorensen – Gary Kelly
Universal Synaptics Corporation
http://www.usynaptics.com
801-731-8508

MS Word File (448KB)
                                                                                                                                                         Adobe PDF (417KB)

ABSTRACT

In the areas of avionics maintenance and compliance testing, two radically different age-related failure mechanisms, out of tolerance and random intermittency have for decades shared a common testing treatment. Historically, the idea that one test could cover both was only marginally successful and as digital measurement processes replaced analog, the ability to detect and repair intermittent defects has inevitably become critically alarming.

Cases in point: 1. The decades old and worsening No Fault Found (NFF) phenomenon, where presently fifty percent of all pilot reported defects go undetected and therefore unrepaired. 2. The current concern and unprecedented involvement by the White House Office of Science and Technology Policy (OSTP) which recently initiated a government wide investigation into the problem of Aging Wiring and its effect on our entire defensive, nuclear, communication and transportation infrastructures.

A past rational for tolerating this poor testing performance may safely be assumed to have been the lack of a proper testing technology devoted towards intermittency. However, the required technology now exists, yet it has sparked little interest or implementation at those government agencies tasked to implement strict safety measures and insure the reliable sustainment of our country’s defensive treasures and transportation infrastructures. They refuse to consider a need for intermittency testing.

Comparatively speaking a new rational has emerged in the testing, avionics, and safety communities, one that collides with the OSTP goals and continues to justify and often even promotes the continued use of outdated legacy testing methods. This derelict "do nothing at all costs" rational is guaranteed to make aircraft and avionics systems less reliable, less safe, and much more expensive than they should be.

One recent direct example of this error in testing is showcased in the NTSB’s investigation into the crash of Flight 587 in New York City on November 12, 2001 killing all 265 passengers and crew. The American Airlines A300 may have actually crashed due to a series of apparently known but unrepaired intermittent glitches in the rudder control circuitry.

Another example is NASA; outwardly the world’s most technological advanced scientific research organization, has suffered from a long history of intermittent, wire-related, malfunctions and disasters. Possibly their most recent wire-related testing and reliability oversight is the "statistically significant", near back-to-back misfires of critical explosive bolt charges. Explosive bolts have failed to fire during critical Shuttle launch procedures, yet NASA managers steadfastly refuse to update their technologically obsolete testing procedures and equipment in spite of the Columbia Accident Investigation Board’s recommendations to do so (see CAIB chapter 10.9).

Incremental "technology-creep" buried deep in the technology and firmware of digital electronic test instruments making them more accurate, has at the same time created a massive testing blind spot, not only in what we can measure but also apparently in what we believe is important to measure. This test engineering error is seen in the form of massive No Fault Found (NFF) rates of 50-85% that can consume 90% of repair budgets and waste billions in fruitless repair efforts, while at the same time compromising safety and exposing us to a wide variety of consequential risks. Still, amazingly, no one is overtly concerned with the consequences or the momentum of this unfounded addiction to higher accuracy instruments and old, outdated, philosophies of testing.

In this paper we examine some of the technical and political root causes of this age-related, intermittent "NFF" testing void, as well as introduce a new testing technology that delivers not another "Band-Aid" patch to a broken and outdated testing system, but a whole new avionics maintenance health-care package via direct reliability testing.
 

INTRODUCTION

In the avionics test and evaluation realm, "higher" accuracy is too often assumed to be synonymous with higher quality, especially in the selection of test measurement devices, where digital instruments now completely dominate over their so called poorer cousins, analog devices.

Quality in a measurement, however, has much more to do with delivering accurate and useful information about the true condition of the device or system under test rather than simply delivering more and more digits to the right of the decimal place.

David Evans, safety editor for Avionics Magazine*1, reports that in their investigation of the crash of Flight 587, the NTSB examined the contents of the Digital Flight Data Recorder (DFDR) and discovered that expected information on known glitches, which likely caused the aircraft’s rudder to oscillate and tear off, was missing on some of the channels. This data was missing due to a process they described as "digital averaging", which was employed in this case by the manufacturer of the DFDR in an effort to conserve internal memory resources.

Evidently, the signals of interest were digitally sampled less often than was needed to fully see the violent swings in the rudder control system. To further compound the problem, it was also discovered that the data that was available on the recorder had also been averaged before being stored, thereby rendering it highly inaccurate. The NTSB says that due to the manner in which the data was processed, the critical information about these glitches was lost.

Tragically, in spite of a huge body of evidence to the contrary*2, the FAA as well as many others responsible for improving aircraft safety, have argued for years that "glitching" or age-related intermittency, or what is better known in the industry as No Fault Found (NFF), does not cause accidents, that it does not hazard the aircraft in any way, and that it is simply a maintenance nuisance only. Ironically, the aircraft’s own DFDR, by failing to measure and record certain data on these intermittent glitches, demonstrated similarly and conclusively why these dangerous anomalies are not detected and repaired during maintenance and preflight safety testing as well. The digital measurement instruments used during testing also employ data averaging internally, in their embedded algorithms, to help achieve their high levels of accuracy. Intermittent glitches are simply not seen by these kinds of devices.

The NTSB says it is concerned about the missing data on the DFDR because first, it makes their investigative work more difficult, and secondly, they thought FAA rules were already in place to stop the practice of data averaging on these digital recording devices. The scope of the problem however goes far beyond worrying about what is or what is not recorded on any aircraft’s DFDR!

The unstated, but critically important inference is that with the expectation of such a safety rule by the NTSB and the efforts by the FAA to stop the practice of digital averaging, both agencies acknowledge that averaging or blindly discarding safety-critical data is a practice that must stop. Their stated after-the-fact objectives however, do not diminish the need to also stop such practices before the fact as well. The most important goal is to keep aircraft safely in the air while investigating those unfortunate failures is secondary.

After examining the previous repair history of Flight 587’s rudder control system, Mr. Evans reported*3 that it is highly probable these "glitches" have been present and their related failure effects seen at least a dozen times or more over the 12 month period prior to the crash. In fact, the same or a similar, but progressive failure required a reset of the rudder system computer once again just minutes before the fatal take-off.

Given the relative benefits of discovering these latent intermittent anomalies through testing and fixing them before flight, as opposed to theorizing about them after a tragic and preventable crash, makes this testing-void oversight a matter of prime importance and a critical safety consideration that cannot be over emphasized. What value is it really, to spend millions in investigative efforts, to once again "discover" that yet another wire caused yet another accident, when all 5,000 wires in an aircraft can put an aircraft at risk, and they are all aging.

The DFDR, acting as a digital measurement and logging device, in fact reported no problems at the same time the rudder was ripping itself off the aircraft!

What the NTSB and the FAA are failing to grasp in their crash analysis is that digital averaging, employed heavily during the testing processes, also masks these age-related intermittent defects and is itself then the root cause of a very large portion of these accidents!

The point with which everyone who flies should be concerned is why these safety agencies don't seem to be too interested or concerned about looking a little deeper into the root of the problem, the intermittency testing-void problem, and then fixing the part which is broken.

The DFDR of course, is not being blamed as the cause of the crash, but it demonstrates conclusively how the process of digital capturing and subsequent mishandling of critical time-dependent data can completely skew the information on which modern avionics control and computer systems rely for reliable and safe operation.

Missing Data, An oversimplification of the problem:

While the DFDR was indeed missing some critical data due to averaging, in reflecting about the aircraft and thinking in terms of computerized flight control systems and digital-based logic, one quickly realizes there is no such thing as "missing" data. Everything is either a one or a zero, a true or a false. Any "data" in a digital flight system that is "missing" due to an intermittent "glitch" likely will be interpreted as being either an "on" or an "off", signal or in the case of Flight-587, a rudder-left or a rudder-right command. A false "up" or false "down" in a flight computer’s logical abstraction system can be much more than just an annoyance to some unfortunate, over-worked accident investigator. It can also have a considerable impact on whether a couple of hundred passengers and crew have a "good flight" or not. "Missing data", whether occurring in flight or at test time, is serious business and as such, steps should be taken to prevent it.

A WAKE UP CALL

Logically then, it would seem to follow that if digital averaging is not a good practice to employ inside DFDRs, it ought not to be employed in other critical areas such as preflight and maintenance testing as well. Otherwise, "averaged-out" intermittent defects in the wiring, flight control boxes, or Line Replaceable Units (LRUs) at test time remain unknown and on-board to repeatedly hazard the aircraft.

Averaging any testing data obliterates the true facts as effectively as if someone shreds documents critical to the outcome of a high-level investigation. What transpired with Flight-587 should serve as a major wake-up call for the NTSB, FAA, NASA, DOD, Airlines, Aircraft Manufactures and others to finally look into the testing-void problem in a more concerned, determined and all-inclusive fashion.

While recently "Aging Wiring" has been the popular catch-phrase for many of these recent crashes and more numerous "safety incidents", the industry has balked at coming to grips with the closely related and more serious No Fault Found (NFF) issue. The excuse for this lack of performance is claimed in large measure to be due to a lack of hard data on the problem.

The reality is that there is plenty of data available, one just needs to recognize it for what it truly is and tie together the various pieces to solve their so-called No Fault Found (NFF) puzzle.

The keynote speech at the 1995 Airline Maintenance Conference by James L. Pierce, Chairman and CEO for ARINC, said in part, "Despite significant improvements in mean time between failures of current avionics and the introduction of more comprehensive and accurate built-in-test, the No Fault Found (NFF) ratio stubbornly hovers in the 50% area. This in effect halves the potential benefit of the technical advances made in avionics". (And obviously a whole lot more.)

Air Force Aging Aircraft researchers collected the data on the left side of the following pie chart as their view of what kind of connectivity elements they "find" breaking down at test time. Their chart failed to acknowledge however, that they also find many No Fault Found results in the "testing process" too, so we have added the NFF half that is missing to demonstrate a more comprehensive picture of just where this so called "missing" test data is.

Who ATE the other half of the Pie?

Chart 1: "Who ATE the other half of the Pie"?

Obviously, when a test fails to find an intermittent component, there is no "component" to log into the historical record and with no "component" to log-in, the original failure mode encountered is not logged in either. The "data" that everyone is looking for is there, it is just not understood, reported, tracked or correlated correctly. Most good technicians of course know about these issues, and deal with them as best they can with the limited testing tools and authority they are given. One of the reasons they do not get the right tools is because the money managers will not spend money on a problem that does not generate reflective NFF data and as they circularly reason, the problem therefore must not exist.

SETTING PRIORITIES

Those individuals whose responsibility is to make flying on older aircraft safer need to be consistent and stay focused on the facts. The Bogus Parts issue, a problem geared mostly towards the potential risks associated with hardware components and missing inspection paper trails, gets rather serious attention with hefty fines and even jail time for those not concerned with safety and reliability. The companion safety effort looking at aging wiring, a concern again directed towards the potential risks associated with electrical and avionics safety, also gets a lot of organizational attention and funding. Incredulously however, the more serious NFF problem, where 50% of all pilot-reported flight malfunctions are NOT resolved during ground-based testing, receives little or no concern!

With NFF you have not just a risk, but an actual, demonstrated malfunction of a component in the air, and a demonstrated inability to detect or fix that failure on the ground, and an officially sanctioned permission to reuse that part on the aircraft, and a paper trail documenting the whole process, and all the while, technical solutions exist to stop it!

Maybe if we called the NFF problem a "Bogus Testing" problem, which it certainly is, and instigated the same penalties, results would soon be forthcoming.

In reality and as a bonus, the root cause of both the "aging wiring" safety concern and the neglected, mysterious NFF "maintenance issue" as well as a large percentage of the so-called I2R "wire fires" is the same. It is simply a continuum of aging processes, misapplied testing equipment and legacy testing practices that fail to detect the underlying problem. Which is better then, to fix the immediate problem of inadequate testing or the problem of how to make wiring insulation that lasts longer? The proper choice should be clear. Without first eliminating the bottleneck of improper testing, how can one gauge the benefits or success of any other possible efforts?

To continue to ignore or deny the fact that NFF intermittency exists or to do business-as-usual, performing precision bench or depot testing on safety-critical systems exclusively with instruments that, like Flight 587’s DFDR, average these age-related intermittent glitches out of existence at test time, is a serious safety and technical oversight and clearly is just plain wrong!

It bears repeating: Digital averaging, of the exact variety found on Flight 587’s DFDR, about which the NTSB and the FAA are concerned, is also employed heavily (almost exclusively) in the testing of virtually every aircraft avionics system and component. This includes on-board and pre-flight testing, on-the-ground depot testing with manual and automatic test equipment, and perhaps more critically, laboratory specification and validation testing as well. One might want to take a closer look at how a supposed 100,000-hour MTBF component really got its ratings.

As an example, when avionics design engineers select the components that will be used to build their system, they make informed decisions based on those component's published specifications. Those specifications may have been developed inside a laboratory-type testing facility and as such have generally been considered to be above reproach.

However, in the case of connectors and switches and other intermittent-by-design connectivity components, one of the specifications an engineer looks at is the insertion or use rating. The insertion ratings are often based on the number of times two pins from the connector can be "rubbed" together until some ohmic value of fretting resistance builds up*4. To measure this "one-ohm" standard, the rubbing motion may be stopped every 1,000 cycles or so, and a high-precision digital measurement (averaging) device will be used to take the measurement, but it is generally taken in a static, non-stress mode. At what point in the rubbing life process did the pins actually become intermittent and therefore unreliable is unknown and unpublished using the present testing methods. Few if any are concerned or are testing for the real or dynamic failure-mode aspect of safety and reliability.

Some enlightened companies, performing Highly Accelerated Life Testing (HALT) and using special analog test equipment, are reporting that tests, scheduled to run for several days, based on component specifications, MTBF and other historical reliability estimates, are often failing relatively early due to rampant and unexpected intermittencies in certain connector assemblies.  See Evaluation Engineering article: "The Achilles Heel of Modern Electronics:
http://evaluationengineering.com/archive/articles/0604/0604modern_electronics.asp)

The connectivity systems found in aircraft such as wiring, connectors, crimps, relays, circuit breakers, solder joints, wire wraps, etc., are in fact electro-mechanical devices. Therefore, like machinery, they will naturally age or wear out over time as a result of mechanical wear such as "rubbing" or fretting motions at the molecular level due to ambient vibration and thermal expansion, as well as a variety of environmental corrosive factors.

When they first begin to age, they gradually become more and more unstable, noisy, or intermittent due to the rather happenstance mating of the micro hills and valleys found on the contacting surface areas as environmental dithering causes these mating surfaces to slip, or expand/contract past each other. It is this slipping action, from a sufficient number of microscopically good contacting points, across or landing on an area of insufficient, ohmic, or contaminated contacting points, that causes the intermittency. The end result is that at test time, a connectivity element can measure OK ohmically one moment, yet easily fail intermittently the next under vibration, stress, or thermal expansion and contraction.

Single or even repetitive digital testing, just cannot test for both hard and intermittent failure states adequately. The end result is that engineered service life and MTBF type ratings based on incomplete testing should be severely reduced and any mandated periodic maintenance intervals based on overstated specifications should be reevaluated.

It Gets Worse

Because of the necessity of a multilevel maintenance model associated with the aircraft industry, the diagnostic choice for all on-board failures both hard and intermittent is to first just reset any tripped circuit breakers or replace computerized Line Replaceable Units (LRUs). Secondly, depending on the results from the first effort, the technician may next replace any further suspect avionics "boxes" with "believed to be good" units from the spares pool. The problem here, and with any multilevel maintenance environment, is that the root cause defect may "reset" or it may be separated from the rest of the system before any conclusive diagnosis is achieved and verifiable repairs made.

If the pilot reported problem turns out to be intermittent or NFF, both the LRU and the aircraft wiring should be comprehensively tested separately, to actually isolate the source of the problem at least to one system or the other. However, if the only "test equipment" employed to test either system is the replacement LRU from the spares pool, it had better be perfect.

If this replacement LRU is part of an older aircraft system and that system is experiencing 50% NFF rates, the replacement LRU obviously has a 50% chance of being an unrepaired previous NFF swap-out from another aircraft, and there goes any perceived so-called "quality" in the testing standards.

This scenario helps to illustrate how the rudder control computer on Flight-587 happened to be replaced 12 times in the 12 months previous to the crash with apparently no repairs to the underlying root cause. Intermittent anomalies simply stay in the aircraft or LRU until they eventually turn into semi-permanent hard failures, repeatedly hazarding the aircraft as effectively as the more media-popular insulation-related shorts or arcing problems.

Obviously, at no place in the multilevel maintenance cycle is there a specific, qualified test for intermittency, even when the problem is known to be of an intermittent and therefore digitally untestable nature!

The Imbalance in Intermittent Fault - Test Equipment

Chart 2: The Imbalance in Diagnostic Testing Equipment

Of course, with maintenance testing systems so full of systemic testing and maintenance voids like this, there is going to be a lot of NFFs and an aging problem! This is in fact, why avionics systems wear out: With a long history of never repairing these intermittent problems, they eventually become too unreliable for sustained use.

The "problem" behind the "aging problem" then is not one of aging per se; everything ages. Rather, it is a problem of NOT testing directly for the signs and effects of aging. We test a connectivity component's parametric values (ohms), which may include some small degree of random intermittency, but we test it with "hard-failure" equipment only. We test exclusively with equipment that averages these intermittent fluctuations right out of the testing record. Like an ostrich sticking its head in the sand so it can't see the dangers lurking about, not detecting these defects does not mean they do not exist!

An Historical Perspective

The issue with present testing methods concerning high NFF rates is that over the last 30-40 years, the trusty old analog meter has evolved to become a highly sophisticated, 8-digit, digital measurement device, with phenomenal accuracy, at least when used in a stable environment. Possibly you need this high accuracy to detect out-of-tolerance resistors, capacitors, and other passive devices whose internal characteristics tend to drift with age, but the electro-mechanical devices need to be tested completely differently.

Like most good things in life including high accuracy, there is a penalty. The penalty for aging aircraft electrical and avionics systems is that as accuracy in these measurement devices has increased, mostly through various techniques of digital averaging, these devices have lost their ability to see age-related failure modes such as random intermittency or glitches, the primary failure mode of electro-mechanical components. To exacerbate the situation, during this same period of transition, aircraft designs have incorporated more and more electronics with more connectivity elements to break down, and often these digital-based and computerized avionics technologies are more susceptible to fail due to glitching than were their analog counterparts.

An important question to consider is: With aging of the connectivity elements creating unique testing, reliability and safety risks, doesn't it make sense to adopt a little testing flexibility? Instead of testing exclusively for engineering parameters for the entire life of the system, start testing directly for the aging related failure modes when the NFF rates show an increase due to the age of the system.

The testing of engineering parameters (ohms, volts, capacitance, etc.) is simply a functional test of the circuit's operation at that exact moment in time; it tells you nothing of its ability to maintain these parameters over an extended period (even during the next flight), or in an unstable environment. In contrast, testing for age-related failure modes (random noise, glitches, and intermittent discontinuity) allows you to discover and fix any NFF latent defects not otherwise detectable, as well as any developing defects that may not yet have caused a system to fail. This new "deep-look" or prognostic capability has the end effect of actually "renewing" these Aircraft and other systems to a previous level of reliability and provides a measure of confidence that it will continue to perform like new for some extended period of time.

Intermittency connotes two states of being: One finite, or the good state, and the other infinite, or everything else — and that can cover a whole lot of presently untested territory.

Measurement Tradeoffs

In the selection of testing equipment to be applied to the testing of airborne devices and systems, there are measurement tradeoffs that have not generally been properly considered in light of age-related failure mechanisms. In nearly all cases, the more accurately a meter can split a fraction of an ohm or other quantity, the slower it is likely to operate and therefore, the less likely it will be able to respond to intermittency or glitches in any meaningful way.

The two desired measurement goals, high speed for glitches and high accuracy for the parameters, occupy opposite ends of the measurement spectrum. Hitting either end exclusively has its own opposing penalty.

The quick answer by testing pundits to the urgent and universal lack of "glitch" detection, "just raise the digital sampling rate", illustrates in some measure the technical ignorance to the problem as well as the cultural prejudice*14 against even the possibility that a much better analog solution might exist. The unlikely logistics of being able to raise the sampling rates on $50 Billion in DOD-installed testing inventory is further compounded technically in the ohmic measurement of long runs of aircraft wiring where the natural capacitance between wires or wire and aircraft frame, or any explicit analog filtering, limits the actual testing rates to rather low values. This ambient capacitance must be charged or recharged each time before a measurement is made. Further compounding the problem, if multiple points in a wire bundle need to be measured over a period of time, certain switching delays will also apply, again severely limiting the testing rate. Even if you could double or triple the measurement device’s sample rate, the actual benefit when testing multiple wire systems would be so small as to generally not be worth the effort and may actually introduce other difficulties.

More importantly perhaps than accuracy and high sample rates, intermittents happen randomly in time, or when they "feel" like it, and also randomly in degree of severity even on the same wire. A one-microsecond, 10-ohm, one-shot glitch on the ground might easily become a one-second wide open or short in the air, or a fast-fluctuating, "good-bad" state due to the additional environmental stresses and extremes that are naturally encountered there.

A DEGREE OF CONFIDENCE

Every good Test Development or Reliability Engineer is aware of the term "Degree of Confidence", where as an example, they can look at a circuit topology of digital logic gates, consider the number of possible ways that the circuit could fail, such as "stuck at 1" or "stuck at 0", and then use that information to develop digital test patterns or vectors to test the card for each particular possibility. When all possibilities of failure are covered by a test, there is a high degree of confidence that no matter how the card fails digitally, the test program will identify it as a failure as well as identifying the failing component.

In the case of highly complex digital circuit cards, massive software algorithms running on large main-frame computers are employed to model the card and help develop the necessary digital test vectors or patterns necessary to insure that every digital gate is exercised and that all failures are propagated to an external pin where it is visible to the ATE employed to test the card. Typically, Degrees of Confidence of 95% and higher are sought after and most often achieved on these active elements. A complete test may consist of thousands of test vectors or patterns to be applied as stimuli. Achieving these high levels of confidence often takes hundreds or thousands of engineering hours and is the main driver responsible for the high costs of ATE programming. These high levels of confidence are of course important and necessary to maintain high levels of operational readiness and confidence in the larger system.

Scientists and Reliability experts, using statistical or probability science have developed the means to compute the confidence level of an analog test or system as well. Usually it is a pretty straightforward assignment.

However, for wiring and other connectivity elements the "Degree of Confidence" determination becomes somewhat problematic. Rather than a measurable, predictable or linear degradation over time, aging generally produces degradation in performance that is not linear, but intermittently chaotic and random, and often a function of the micro environment in which each component is located. If this were not complicated enough, you also have to factor in the probability of detecting these random analog "events" with the particular brand and model of digital test equipment being used.

When designing a testing program for an avionics LRU, it is deemed necessary to have at least a 95% Degree of Confidence in the testing of all active logic circuitry, however, for the connectivity components that "glue" the logic system or LRUs together, the true but apparently "acceptable" Degree of Confidence seldom is as high as even a fraction of a percent.

The resultant is that for hard failures, which account for about 50% of all failures, the testing, diagnostics and fault modeling tools and equipment are developed and in place, and when used properly deliver high levels of confidence to the testing and maintenance process. In contrast, for the other half of the total failures, the intermittent NFF failures, that may in fact consume 90% of sustaining costs, one will never come close to the same performance, even theoretically, using any type of digital measuring equipment. The numbers and the science simply do not support it!

And here is the crux of the reliability or NFF problem: When it comes to the testing of the passive connectivity elements, mathematics and science get tossed out, in favor of erroneous belief systems based on what kind of testing has been "acceptable" in the past. Even though a testing-void problem may be recognized, once a process is firmly entrenched in an organization, no one wants to be the messenger that "agitates" for a change.*14

The FAA’s recommendation for dealing with intermittency is, after any repairs, to just perform repetitive testing, but the missing factor is to what degree of confidence? How long do you have to repetitively test a wire or circuit digitally, one-point-at-a-time, to assume with some degree of confidence that you will have seen any or all anomalies that might still be lurking about? To reach the more-or-less defacto 95% you may have to repetitively test for days, months, or even years, depending on exactly how and with what you do the testing.

Even in measurement systems that employ fairly high sample rates, about 1/3 of the time might be spent measuring while the other 2/3rds will be spent analyzing, storing or displaying the data. While performing these data processing tasks, digital measurement devices are of course totally blind as to what is going on glitch-wise.

To make matters worse, some laboratories doing specification testing may test a single line digitally, millions of times, over a period of several days, and if any failures, intermittent by definition, are detected they then apply statistical bell curves to the failure data to eliminate a percentage of these. Their rationale for discarding what is most likely actual failure data is that they assume, "through experience", that some may be false failures generated by the measurement device's own internal operations. Perhaps this is because the failures they do see are random and not repeatable; which is exactly the way they appear when they are real.

With glitch duration's as short as a microsecond being considered risky or even unacceptable, and some properly equipped connector testing labs reporting electro-mechanical-based intermittencies down to a few nano-seconds*5, what testing standards should be applied to obtain the necessary degree of confidence for safety and reliability. When the FAA says all that is required in the way of testing, after repairing an intermittent problem, is to "repetitively test the wire", what exactly is the expected performance criteria? What digital sampling rate should we apply, and for how long; a second, an hour, a week? Was the failure in a system or a specific wire? Do we need to "repetitively" test one wire or maybe hundreds of possible suspect wires? The standards are obviously a little loose and vague in this regard.

Digital Averaging can be looked at as occurring by one of three methods: inherent, explicit and spatial.

The first type, "inherent", is incorporated in the digital technology itself. In an analog system, all noise is accumulative over time to the signal being measured, and the result is that the accompanying display or other output device will respond in lock-step to that noise. Digital equipment, however, samples the signal of interest and only accumulates the noise present just at that brief instant of sampling. We don't have to do anything, really; just take a reading or sample and convert it to a number or other signal representation for "averaging" to occur. Once the real-world signal has been captured and converted to its digital representation, we have absolutely no idea as to what is occurring with the real signal between measurements.

At the heart of most sampling schemes is a "sample and hold" capacitance that on the first part of the two-part procedure acquires a sample of the signal to be measured over some finite period. Any variation of the measured signal during this charge period is averaged out through the filtering effect of the "hold" capacitance. On the second part of the procedure, the "averaged information" that is being held in the capacitance is of course stable and can now be accurately processed and eventually delivered to the user. What goes on with the real signal during processing of the original sample is of course lost, or "averaged" out.

Note: A special type of sampling is often employed called "Nyquist" sampling, where the sampling rate is varied to prevent certain AC signals, in lock-step with the sampling frequency, from being sampled at the same point on the waveform each cycle, and returning the erroneous conclusion that the meter is looking at a DC signal. In this discussion, this problem does not apply since most age-related intermittent signals are of a one-shot, chaotic variety, rather than being repetitive.

To further refine the accuracy of the digital instrument, "explicit" averaging is employed. Here the measurement device may take 10, 20 or a hundred "inherent" samples and under an internal "program" directive, average the accumulated totals by the number of samples taken. This removes or filters out any supposedly undesired but natural instability that the basic measurement process itself generates as well as the variations or glitches that may happen to occur in the signal being measured.

Both methods work well to deliver what they call "high accuracy" readings, but only when the signal being measured is itself stable. *6

The problem with both the "inherent" averaging and the "programmed" or explicit averaging is that both are ultimately based on a fixed sampling window or rate, while the intermittency is occurring randomly. The one-shot "glitches" that are generated when the electromechanical connectivity elements shift, or break down momentarily, simply cannot be guaranteed to occur in synchronization with the sampling pattern or measurement window of the digital instrument.

In addition to the randomness-in-time aspect, "spatial" averaging also needs to be considered when testing wiring harnesses, motherboards or complete aircraft systems. Here, the already glitch-limited capability of the single channel digital instrument must now be averaged out over a number of possible or "random" failure sites all requiring equal and substantial testing for any suspected intermittence.

The end result of all this averaging baggage is that you might catch a glitch if you are extremely lucky, or you might catch a part of it, or more likely, you might not catch any of it. Accuracy then, as well as repeatability, in the presence of age-related intermittency is a complete myth, and the information delivered by the digital instrument is at best a half-truth and generally it should more properly be considered a lie.

The Missing Data: A "Glitch" Test Comparison of Instruments

In the following two screen-captures, we demonstrate how random intermittency is seen by two different test instruments. One is based exclusively on the latest digital measurement technology while the other, the IFD-2000/3000, a new unique hybrid design, consists of an analog front end designed to operate as a neural-sensor (parallel-based) which in-turn employs digital processing on the back-end to enable traditional computer- based data handling and user interface chores. With this design, one gets the best of both analog and digital worlds, and in the testing of critical avionics circuits, this combination of technologies can make a huge, important, life or death difference in the results.

The instruments chosen represent the leading, top-of-their-class, most accurate devices available in a particular realm of the avionics-testing domain. If any conclusions are to be drawn from the results, it is not that one meter or device is better than another, but that no single meter does all tasks equally well. The choice of instrument for finding age-related intermittencies makes all the difference as to whether an aircraft is truly operationally safe and reliable or should be kept on the ground.

The problem in not being able to detect glitches accurately is not with the digital instrument's maker, it is with the test design engineer that chooses a testing platform without understanding the measurement requirements when aging failure mechanisms such as random intermittencies are involved. Along this same line of reasoning, we readily acknowledge that some digital devices have special operational modes that can aid in detecting glitches. However, since these special modes are not generally used in normal ATE-like parametric testing scenarios, and because their effectiveness is severely limited in multi-test-point situations, we did not test their effectiveness. Again, this test exercise is more a comparison of basic technologies presently in use rather than any particular vendor's equipment. The reader is invited to perform the following rather simple tests and to insert his own favorite or selected device’s capabilities.

TEST SETUP:

In all examples the test method is the same. To simulate real world failure scenarios such as a wire either intermittently opening up or shorting out, we applied a steady 4.0 volt signal from an HP 811A programmable pulse generator which was setup to drop the steady dc output signal from the applied 4.0 volts to ground potential as a one-shot phenomenon triggered by a random event. The "random event" was when we randomly chose to press the generator’s "one-shot" button. We tested each meter’s ability to capture the programmed "glitch" at various glitch-width settings appropriate to the measurement device’s range of capability.

FLUKE 189, Digital

We chose the Fluke 189 Digital Multi-Meter (DMM) for it's mid to high accuracy capability and its rather robust sampling speed as our baseline meter. In order to see more than just digits flash when the intermittency occurs, we connected it to our PC via the RS-232 serial port. In this way we could capture and log the results over a longer time period and we could accurately display in a graphical manner what is typically going on with Automatic Test Equipment as well as most point-to-point wire (continuity) testers used in the avionics testing industry today.

The proprietary computer display screen, shown below, is arranged in two sections and is designed to show not only the present reading but also the deviation from normal over an extended time period. Without such a display, all the operator will likely see is the meter’s digits flash momentarily, and unless total and constant attention is paid to the display, that small bit of testing information will be lost.

Bottom Screen Description:

DIGITAL RESULTS

The bottom section, consisting of the horizontal bar with a "fail" circle at either end, is a typical ATE measurement or adjustment bar. The bar’s length represents the scaled tolerance limits of the reading being taken and

Screen one. Glitch test with Digital Meter

the centerline ^ represents, in this particular instance, the expected or centerline value of the measurement. In operation, the short vertical bar above the limit bar represents the present reading. As the present reading fluctuates over it, it will leave a vertical ‘tic’ inside the limit bar at the various measurement points. As can be seen, there are several lines to the left side portion of the measurement bar caused by the applied negative-going glitches. Notice that in this bottom screen graphic, there is a lot of non-repeatability in the measurements as the glitching causes the meter’s readings to vary from 2.0339 minimum to 4.0418, the applied maximum. A perfect example would have shown a single vertical mark in the limit bar in response to the approximately 65 repetitive glitches, each of which ranged from 4.0 volts down to 0 volts with duty cycles of 100ms, 10ms and 1ms, as annotated after testing.

Top Screen Description:

The top screen is a logarithmic virtual scale, plotting two readings over time. The ragged line (on the left side) just under the 10-volt scale marker tracks the meter’s normal output over time, while the "glitch-like" pulses underneath show the logarithmic deviations from the nominal value when the test glitches are applied.

The purpose of this top computer display is to provide the test operator with not only a scale of normal meter readings over time, but also the ability to track any small, usually unnoticed variations. These smaller deviations, when amplified by logarithmic scaling and tracked over time, help the technician or engineer to see and understand various small or micro disturbances that are often missed by normal measurement methods. These micro disturbances can often be used to predict catastrophic failures as well as demonstrate any small measurement drift over time as might be caused by ambient temperature and humidity variations.

Digital Test Results:

As can be observed on the Digital test screen, Screen one, the digital meter struggles to report accurately the 4-volt applied test glitches, even as wide in duration as 1/10th of a second. Notice that on the bottom of the screen under the 100ms bar, the minimum reading delivered was 2.0339 volts when the actual applied pulse was zero volts. The accuracy delivered is approximately 50% in this case, and as can be clearly seen, approximately 25% of the applied glitches would likely not have even triggered a failure with the testing program.

Under the 10ms bar, not a single 4-volt pulse triggered a response reflecting a change greater than approximately .2 volts, and as far as performance at the 1ms range is concerned, the digital meter almost completely missed any deviation whatsoever. This is as far down the intermittency scale this meter can take us, even with the special logging and display software.

Advantages to Computer Control:

If we are just looking at the meters own display, these glitches show up as a very brief blip in the display numbers. It changes so rapidly however that we can’t really resolve what the true value of the glitch is, only that possibly something occurred or we may have blinked. If we are not looking constantly at the meter’s display digits however, the glitch is not seen at all.

Typically, if an ATE’s operating program commands the meter to take a reading at the same time that the glitch occurs, it may sense a failure or it may not, depending on some rather precise timing. Even if it does see the glitch, ATE may then go into retest mode and if it subsequently finds the next reading to be OK, it will likely continue testing as though no fault had previously occurred. By testing and taking a series of readings over a longer time period, and by posting the true results to the screen, the test program or the test operator can easily see some of these mysterious glitches, at least of this magnitude and duration.

Every digital meter is different in its underlying specifications and ability to detect these random glitches and it should never be assumed that just because the meter returned a certain value this is actually what the signal was doing. When glitching or intermittency is present, the returned value may not even be in the ballpark as Screen One above clearly illustrates.

IFD Intermittent Fault Detector, Analog-Neural

Screen and Test Description:

Next, we applied the same battery of glitch tests to the IFD Intermittent Fault Detector, a parallel "neural-analog" device that does not scan, sample or perform averaging, whether testing 1 or 40,000 test points. Its main design purpose is to detect these random-in-place-and-time, age-related glitches rather than to deliver discreet parametric values to the test operator or computer program.

At the top of the screen, Screen two, the 6 columns represent the accumulated number of times these 6 test points detected the applied glitches whose duration or event time ranged from 100ms to 1us respectively. The various test results are posted on the left side of the upper portion of the screen, and if you are viewing a colored printout, you can see the results are ranked and mapped to the colored caps on the tops of the columns. The oscilloscope-like trace at the bottom of the screen shows the signal that actually triggered the event as it appears to the IFD’s neural network. The color of the trace here is mapped to the color of that test-point’s columnar display. As was recorded, the 6 ranges of glitch duration's were applied 25 times each, so the ranked accumulative score shown at the upper left portion of the screen is reporting 25 "hits" for each of the six glitch duration tests.

IFD TEST RESULTS

Screen two. Glitch test with neural-analog device

Neural-Analog Testing Results:

As can be seen, the IFD identified faithfully all 25 glitches of each of the 6 applied pulse duration's. As is typical there were no missed events and no false events. The IFD series will perform glitch detection correctly, right down to 320 nano-seconds, and custom models can go much lower. Also notice the layered trace response at the bottom of the screen. For all 6 test-glitch duration's, the IFD logged all 25 traces directly over each other for perfect accuracy and perfect repeatability, something not even closely seen using the digital measurement device.

In addition to the IFD’s single-channel superior performance over the single channel digital device, consider that the IFD-2000/3000’s neural network technology performs this same level of detectivity simultaneously on all channels regardless of how many channels are actually under test. Other single channel testing devices will have to scan across a number of channels to apply their already limited capability and accuracy to the number of points that need to be tested in a large and complex system.

The Neural-Analog advantage

To begin to comprehend what this neural technology brings to the "aging" testing table, consider that on a system of multiple wires or units to be tested, digital technology designed and programmed for highest accuracy will require about 1 second per test point measurement. For the sake of simplicity let’s include .1 second for switching times and other delays.

This one-second time includes the time to switch the measurement device from one point to the next, apply the measurement stimuli if ohmic readings are involved, followed by a settling delay while the test line and aircraft lines charge to the level of the stimuli. After the settling delay, the meter then very briefly samples the line in some fashion, followed by a period of averaging and computation and finally the delivery of a measurement value to the user. During this 1-second period, the line to be tested is under actual measurement for probably not more than 1/3rd of the total testing time. If any brief age-related glitches were to occur during the other 2/3rds, or the major portion of the measurement cycle, they would be completely missed, and if they occur during the brief sampling period, they would be integrated or averaged out to a degree reflecting the unsynchronized measurement window.

If the signal of interest was highly stable during the sampling period, the digital results will be highly accurate, but if unstable, the result will be highly inaccurate and may not reflect closely the actual circumstances. Glitches shorter than 100ms may be missed totally and those that are detected are usually highly inaccurate. Remember, in this discussion, it’s the age-related glitches that we are concerned with detecting. These glitches are the signs that a failure is imminent and the item being tested is unreliable.

Contrast the above with the IFD’s neural-analog technology. Since it does not have to switch from test-point to test-point or wait on delay times, we can assume a finite testing step which is the device’s lowest resolvable intermittency capability or 320 nanoseconds, with custom extension capabilities approaching <50 nanoseconds. The net effect is that the Neural-Analog technology, on any single test point, effectively performs 3.125 million tests for each one-second test that our typical digital tester performs in a switching-type testing scenario.

A selected length of time to parametric or continuity test a 100-wire harness with a typical ATE type tester let’s assume is 100 seconds. A one-second test for each wire, or one single test, is thought by some to be adequate because it is high-accuracy digital equipment that they are using. We don’t agree, however, that any given wire will have been adequately reliability-tested in that short a time period. (See note below)

You could say then that, effectively, each individual neural test-point in operation is performing testing at 1/320ns or 3.125 million times better each second than the digital tester. When this result is multiplied by the 100 test points in operation, all at the same time, you get a gain of 312.5 million for each second that you test. Because the digital tester is going to take the full 100 seconds to test all 100 points, we need to multiply the previous product again by the 100 seconds they will be testing. You now find that the net gain in test coverage for 100 wires to be 31.25 billion, using IFD neural technology.

Another way to look at this is that the probability of the digital tester finding any glitches as long as 320ns on a 100 wire harness is only 1 part in 31.25 billion. If these kinds of long odds were quoted on a gambling venture, the phraseology would be quite descriptive, and essentially true.

Note: On a real-world analog basis we may want to test over a much longer time period while a variety of stresses are applied that include vibration and temperature cycling. This type of stress testing is obviously economically impractical for scanning digital testers because of their low level of confidence problem and corresponding extremely long test times to compensate for it. You could literally wear the test item out waiting to be able to detect something. On the other hand, the idea of stress testing is highly practical for a neural-analog device that can usually pick up random low-level failures in just seconds.

SINGLE CIRCUIT COMP

Chart 3: Test Coverage / Probability of failure detection.

Should you need to increase test size from 100 wires to 1000 wires, an order of magnitude, test coverage actually increases by two orders of magnitude. The advantage with neural-analog technology is exponential to the ratio of change with digital, single-point-at-a-time technology. The digital method’s total test time slows down linearly when more test points are added, while the neural method does not change at all. With 10,000 test points to be tested, such as might be expected to be found in a modern fighter aircraft, a parallel array of analog neural networks will provide test coverage that is 312 trillion times more effective than can be obtained with high accuracy digital equipment.

With odds of detection this low with digital technology, how long do you need to repetitively test a given element until you feel confident that you would have seen any defects that may be present?

Since the typical ATE digital tester can only take one "accurate" measurement per second and the Neural-Analog IFD is testing at an equivalent rate of 3.3 million times per second and if you select a test time of 30 minutes, the calculation is: 3.3 million * 60 seconds * 30 minutes = 68,750 days of testing with the digital tester to get the same degree of confidence in the results as you would with the Neural-Analog. And this is just on a single test point.

Comparing other commonly used instruments for intermittent faults

To add some historical perspective to the digital Vs. analog-testing debate, we tested a variety of devices that are now being used and have been used by technicians and others for finding intermittent problems in avionics wiring. The results are quite interesting, especially where some of the older and less-expensive items actually outperformed their newer and more expensive and complex digital counterparts.

1.) The Keithley Instrument Company graciously loaned us one of their highest precision meters, the model:2001, for this "glitch" testing evaluation (shown below as the 7.5 digit DMM). The 2001’s accuracy is about three digits higher than the Fluke-189 DMM which was tested and featured in Chart 1 and shown below as the 4.5 digit meter, and rivals anything being used to test avionics equipment in the field or depot today, at least as far as stable parametric accuracy in concerned. 2.) The Analog MM is fairly generic for a wide variety of such instruments such as the PSM-6, Simpson 260, etc. 3.) The test light used is a typical testing tool such as may be purchased at any Radio Shack or automotive parts counter. 4.) The Neural-Analog device is Universal Synaptics’ standard IFD-2000/3000 and the comparison is made using just 1 of its 256 test lines.

While not being able to interface the Keithley 2001 DMM, or obviously the analog meter or test light to the computer due to different data transmission protocols at this time, we did find the test results to be quite informative by simply running the same series of tests and observing the meters front panel display for any signs of change during glitch application.

1) As the Keithley 2001 is a higher accuracy digital instrument, we naturally assumed less performance in the glitch detection tests, and we were not disappointed. 2) The analog multimeter, with its typical 20,000-ohms/volt sensitivity and a rather slow responding meter movement due to its weight and inertia limitations, also performed to expectations. 3) While the test light required higher levels of voltage and current, it actually performed comparatively well as long as it wasn’t being used outdoors, in direct sunlight. Keep in mind too, that with all non-computerized instruments, you had better not blink when the glitch occurs or you will miss it, and of course these simple devices do not leave a paper trail or proof that the item was indeed tested. It is important to note here that because testing is usually being done off-board in a less harsh environment, glitches are more often likely to occur at a much lower level and are more likely to be of a one-shot variety. While even simple analog technology, such as the test light, has some clear advantages over digital in some regards, without high sensitivity, event latching, and parallel (multiple circuit) capabilities to support it, it also has its own built-in limiting drawbacks.

MULTIPLE CIRCUIT COMP

Chart 4. Commonly used instruments for intermittent faults.

Present Practices - Lessons Learned

NASA says: "Reliability analysis such as reliability block diagrams analysis is used to verify the fulfillment of quantitative requirements. The attribute of reliability, by definition, lies in the probabilistic realm while most performance attributes or parameters such as temperature, speed, thrust voltage, or material strength contain more deterministic characteristics. Within the accuracy of the measuring device one can directly measure performance attributes in the deterministic realm to verify compliance with requirements. No such measuring device exists for probabilistic parameters like reliability. It is usually estimated through comparison with similar components or systems through inference, analysis and the use of statistics." *7

With the IFD series of reliability testers now developed and on the market, the above statement about there being "no measuring device available for probability parameters" is no longer true. By developing the required level of accuracy through probabilistic means, finally making intermittency a deterministic measurable attribute of connectivity, the reliability of a unit can finally be measured directly, repaired, and actually improved upon as part of a direct testing and maintenance effort.

In a practical sense, the information and the metrics presented here are not new. While wire testers are employed to test wiring, the tests performed should generally be considered to be continuity or routing tests only and as we have illustrated through comparison and glitch testing, has little or nothing to do with tests for aging or reliability.

Dennis Williams, wire testing lead at Hill AFB, in charge of running the massive DIT-MCO tester on the F-16 overhaul line, told the 2001 Aging Aircraft Wiring Working Group (AAWWG) attendees that as a matter of Air Force Policy, they never continuity test the entire aircraft, because doing so creates more problems than it resolves. He didn’t elaborate as to whether the wiring systems are too "brittle" to touch or if testing simply never finds anything. He says the only time they test the wiring on any part of the F-16 is when certain modifications are made to the wiring of specific systems, then only those systems are verified with a continuity test, to ensure it is safe to apply power. In a sense you could say that the wiring is tested much better operationally each time the aircraft is flown than it receives with present continuity testing. With no system failures to report after a flight, it could be assumed to some limited degree that the wiring and the signals must be going where they are supposed to go.

In similar fashion, LRU shops have complete continuity testing capability on all LRU chassis. They seldom if ever use it however, because they say they never find any suspected problems with it either. And as our equipment comparisons show, they are probably right!

NASA’s Space Shuttles, now with considerable age on them, are tested about as vigorously as any system in existence, but like everyone else, they are not testing for or finding many age-related problems with the digital- based testers they are using. The consequence is that twice in less than two years, electronically controlled explosive bolts have failed to fire, even after continuity testing just milliseconds before detonation time, and only grace, luck and double redundant charges have literally saved the day for them.

Just after the first occasion, even after exhaustive continuity testing of the harness and other wires leading to the bolts, an x-ray examination of the harness revealed that a broken, but still touching, intermittent wire was the cause of the problem.

By way of note: after the first redundant explosive bolt mishap, we informed NASA/United Space Alliance of our new neural-analog IFD technology. They subsequently asked us to demonstrate its capabilities at their facility by testing the failing cable. Once the IFD was properly adapted to this cable, and within seconds, the IFD had detected their intermittent wire. In demonstrating the IFD with additional testing over the next few minutes, the IFD subsequently caught two other low-level developing intermittents that both x-ray and their hastily kludged-up, daisy chained, analog multimeter had missed. While technically savvy enough to realize that an analog meter would be more effective than their digital continuity testers for finding a random intermittency, they also knew they had shortcomings due to a lack of basic sensitivity in the meter and being daisy chained, they would not be able to identify the exact failing wire in the daisy chain arrangement. In spite of this they assured us that they now had a handle on the testing problem and thanked us for the visit. As you know, just a little over a year later, it was again discovered that other explosive bolt charges had again failed to detonate during launch procedures.

NASA’s real-world experiences follow exactly the results of our instrument comparisons published herein. Obviously, because of the second explosive bolt failure within just a few launches of the first failure, their digital continuity testers and possibly their analog-kludge intermittency testing equipment had again failed to detect the latent defect prior to launch. Except for redundant charges, both instances would have otherwise resulted in disasters. Concerned about the safety of future missions and just before the fatal Columbia launch, we again contacted NASA asking them why they had not implemented IFD testing and again appraised them of the possible risks and consequences. Obviously, by their lack of a credible response, they still do not understand the measurement and testing problem we describe.

NASA and others, testing critical, high priority equipment need to realize that digital based testing equipment has a massive blind spot when testing for these aging-related, randomly occurring, intermittent glitches. Only analog-based technology, when coupled with parallel operation, has the speed and bandwidth capability necessary to achieve the required level of confidence to tackle this testing job. Anything less is just going through the testing motions, wasting time and effort and putting the enterprise at risk.

Costs and Benefits of adding Neural-Analog testing:

The comprehensive costs of the NFF and aging wiring problems are enormous, and collectively, easily run into the tens of billions of dollars every year. The potential savings if you cut these costs are huge by anyone’s standards, but to effectively profit by it requires a modest investment at the aircraft or field level as well as at the LRU repair depots.

The equipment costs to fix the aging NFF problem at the aircraft level run about $300 per aircraft per year based on a five year amortization on approximately $80,000 in test equipment and options, applied across a fleet of 50 aircraft per IFD testing unit. At the LRU depot, costs can similarly be spread out over time. In comparison to present ATE costs, such as might be found at an LRU repair depot, neural-analog testing is only about 10 percent of the cost of ATE.

With some groups reporting that they presently spend 90% of their repair costs on intermittent problems and only 10% on the hard failures, the bonus here is that if this equipment is installed to repair legacy LRUs, or LRUs with high NFF rates, the cost-benefit ratios will be much greater and the payback period much shorter due the increased ratio of intermittent failures to hard failures.

The difficulty maintenance organizations have in capitalizing on this neural testing solution revolves around the artificial constraints of organizational structuring and budget allocations. Field maintenance can certainly find huge savings by testing their aircraft wiring and other components before sending LRUs back to depot, especially when dealing with intermittent failures or inconclusive diagnostics. However, the LRU depots, as presently organized and funded, have a near-zero incentive to invest in equipment that would allow them to take the aging, intermittent problems out of their work product. Remember that these NFF units probably cycle through the depot system periodically with charges to the using squadrons, often in the tens of thousands of dollars for each "repair". When the system being tested is a few years old, and NFF problems are rampant, fully half of the depot workload may require only a simple functional retest and reshipping. What depot manager is going invest any of his operational budgets on purchasing testing equipment that would effectively kill this NFF "golden goose"?

In like fashion, with so much effort being expended on the NFF and Aging Wiring problems, what real incentive does the NTSB, FAA or any organization or associated individual really have in resolving this problem, when that solution risks bringing an end to their jobs?

If leaders at the top of the chain-of-command would somehow take the profit and politics out of the NFF problem, some huge benefits for all would follow. Perhaps the biggest benefit of all is that finally one ends up with a testing and maintenance system that works as you would expect it to work. As such, using the accumulated savings from presently wasted testing and maintenance efforts, organizations can start to replace these older legacy systems with new systems and new technology. For the space launch industry and their underwriters, perhaps finally they can start to plan new space launches in this country without worrying about some anomalous, untested glitch destroying the whole effort.

The key to making all of these improvements a reality is to first recognize that a serious testing void exists and the reasons are not all technical. A definition of insanity they say, is doing the same thing over and over and each time expecting different results. The testing and management philosophies of the past have put us where we are today, and trying to resolve this problem once again with more digital instrument upgrades to higher accuracy is of course, simply not going to work!

SUMMARY

Continuity testing, used as a substitute for age-related reliability testing, may have at one time been quasi-acceptable due to the simple fact that no other testing technology existed, and possibly older analog avionics systems may have been somewhat forgiving of the effects of intermittent wiring and other continuity type elements. With the progression of avionics systems from analog technology to computerized digital technology, these systems have in some regards become more sensitive to intermittent failure modes or glitches. During this same period of technology transition and proliferation, as we have demonstrated in our research, digital measurement instruments, in their quest for higher accuracy, have become almost entirely blind to intermittent failure modes. Today, with the advent of neural-analog probabilistic testing equipment, to continue to use continuity testing or other digital measurement as any projection of reliability is a mistake and simply will not work to the degree required.

Making mistakes, however, is part of the human condition and no one is immune to it. Science and technology attempt to reduce the frequency and the consequences of these mistakes, but still they occur in large measure as a function of limited background experiences, peer pressure, social needs and unrealistic expectations. We take unnecessary risks for a wide variety of subjective reasons rather than taking the specific and objective actions needed to reduce those risks, even when making the necessary changes brings with it huge cost benefits.

Renowned physicist Richard P. Feynman*8, commenting on the reliability issues and calculations surrounding the 1986 shuttle Challenger accident, found a host of simple technological mistakes made by NASA engineers and managers who couldn’t agree on the risks. Management had the probability of disaster at 1 chance in 100,000, the engineers calculated it to be 1 in 1,000, while independent assessments put it at 1 chance in 100. His investigation showed how the O-rings, ultimately blamed as the root cause, had an extensive history of intermittent-type partial failures that had gone unheeded. His notable comments, which are also directly applicable to the industry’s approach to intermittent/NFF avionics/wiring failures, were:

"The acceptance and success of these (previous) flights is taken as evidence of safety. But erosion and blow-by are not what the design expected. They are warnings that something is wrong. The equipment is not operating as expected, and therefore there is a danger that it can operate with even wider deviations in this unexpected and not thoroughly understood way. The fact that this danger did not lead to a catastrophe before is no guarantee that it will not the next time, unless it is completely understood. When playing Russian roulette the fact that the first shot got off safely is little comfort for the next. For a successful technology, reality must take precedence over public relations, for nature cannot be fooled."

In similar fashion, the bulk of the "Aging Wiring" problem, as well as the infamous "NFF" problem and several crashes and other expensive space launch failures, may also have been due to just such a parade of simple technological mistakes*9. Mistakes such as needing to catch age-related glitches, but using test equipment that averages the glitches out before you have a chance to see them. Mistakes of not doing the comparative confidence mathematics, and mistakes such as denial and ignorance in lieu of keeping abreast of new developments in the testing marketplace.

James Pierce, Chairman ARINC Incorporated, a long time supporter of doing something about the 50% NFF testing situation, in an address to a recent RTCA Symposium, spoke of the industry’s plans for the Air Traffic System (ATS) of the future. The title of his presentation was "The Way Forward, Changing the Fundamentals". In his presentation, he spoke of recent technological developments that were incremental in nature, allowing moderate advances, yet providing no long-range goals or vision for the ATS of the future. His admonishment: "If we are to make substantial gains… we must make fundamental changes in how we view and operate the system. Before we begin debating the system architecture, hardware and software, let us first debate and then hopefully understand and agree upon the type of system we are willing to build for our future."

This sage advice applies even more so to the Aging Wiring Research Program*10 and the direction it is heading. They continue to build wiring tests-beds and develop testing plans to evaluate commercial vendor’s continuity testing equipment, but fail to incorporate built-in intermittent faults of varying degree to test against.

By not putting real world intermittent defects into their testing beds, their catch 22-like circular reasoning insures that they will never develop an understanding of the digital averaging problem and its effect on reliability, and ultimately safety. After insulating themselves from the facts, how can these people, responsible for finding solutions, make the fundamental changes or even claim to have had the required debate on what test systems for the future are going to look like? Their conceptual model of the aging and testing problem is several years behind presently available technology and unless some drastic changes are made, their programmatic efforts will likely remain in the past indefinitely. They simply need to repeat the same, simple, intermittency testing steps found herein, and then act on the results.

The question then is this: Do we use our new tools and technology to fix the aging unreliability problem or do we continue to spend huge sums that accomplish absolutely nothing and keep us technologically, scientifically, economically, socially and even defensively running in place? The research has been done on the NFF and unreliable testing problems, the data has been collected and the technology has been invented to fix it. The difficult part now seems to be in implementing the necessary changes.*13 It is called putting progress into action and is normally done at the programmatic and management levels where it is presently stalled.

In the ideal and even the pragmatic way of doing things, the innovators have a moral and often a fiduciary duty to invent solutions to problems while the program managers and maintainers of these systems also have an implied social and moral obligation to test and maintain these electronic and avionic systems in the best manner possible. Any short-term or self-serving benefits, falsely perceived to be gained by maintaining the present testing-void status quo, in the long term ultimately benefit no one and can put the entire enterprise at risk.

References:

  1. David Evans, "When the Average is not good enough", Avionics Magazine, October 2002 http://www.aviationtoday.com/cgi/av/show_mag.cgi?pub=av&mon=1002&file=1002safety.htm
  2. Brent Sorensen, Paul Sorensen, Gary Kelly, Artur Sajecki, "An analyzer for Detecting Aging Faults in Electronic Devices", IEEE, AutoTestCon94 http://www.usynaptics.com/ieeenew.doc
  3. David Evans, "Tracking System Glitches", Avionics Magazine, January 2003 http://www.aviationtoday.com/cgi/av/show_mag.cgi?pub=av&mon=0103&file=0103safety.htm
  4. Neil Aukland, Delphi Automotive System, Manuel E. Joaquim, Santovac Fluids Inc., "Lubricants extend the life of Sensor Connectors, Sensors Magazine, May 2000 http://www.sensorsmag.com/articles/0500/78/index.htm
  5. Stephen Dunwoody, Edward Bock, AMP Inc., John Sofia, Anatech Inc., "A Practical and Reliable Method for Detection of Nanosecond Intermittency"., http://amp.com/products/technology/5jot_8.pdf
  6. Walter Shawlee II, "The analog digital coin toss", Avionics Magazine, Sept. 1999 http://www.aviationtoday.com/cgi/av/show_mag.cgi?pub=av&mon=0999&file=09avsystem.htm
  7. NASA-STD-8729.1 December 1998 : http://www.hq.nasa.gov/office/codeq/87291.pdf
  8. NASA Office of Logic Design, Report of the Presidential Commission on the Space Shuttle Challenger Accident, Volume 2: Appendix F-Richard P. Feynman, "Personal observations on Reliability of Shuttle".   http://klabs.org/home_page/hold_homepage_images.htm
  9. Jacquelyn Cochran Bokow, "Hydrogen Exonerated in Hindenburg Disaster", NHA, News 1997 (Interesting historical text on how a simple technological mistake, like painting the blimp with rocket fuel changed the course of aviation history. Worth a look.) http://www.hydrogenus.com/advocate/ad22zepp.htm
  10. White House Office of Science and Technology Policy, "Review of Federal Programs for Wire System Safety", OSTP, Nov. 2000   http://www.ostp.gov/html/wire_rpt.pdf

  11. Walter Shawlee 2, Avionics Magazine, "How Parts and Systems Age" http://www.aviationtoday.com/cgi/av/show_mag.cgi?pub=av&mon=1100&file=column2.htm
  12. Ed Mayer, Aviation Maintenance Magazine, "No Fault Finder" http://www.aviationtoday.com/cgi/am/show_mag.cgi?pub=am&mon=0500&file=05nff.htm
  13. Jim Saltigerald, 2005 AMC Presentations, Reliability Data Analysis Document. http://www.arinc.com/amc/reports/2005/presentations/nff_saltigerald.pdf
  14. Youssef Bahsoun, 2005 AMC Presentations, NFF Cultural Aspects http://www.arinc.com/amc/reports/2005/presentations/nff_bahsoun.pdf

IFD PORTABLE

IFD-2000/3000 Transportable NFF Analyzer for field and or depot uses.

The "Electron Microscope" of Aging Circuit/Wiring Analyzer