AN ANALYZER FOR DETECTING AGING FAULTS
IN ELECTRONIC DEVICES
Brent A. Sorensen - Gary Kelly - Artur Sajecki - Paul W. Sorensen
Neural Engineering Research and Development
Universal Synaptics Corporation
1801 West 21st Street Ogden, Utah 84401
In theory, electronic systems should not wear out with use, and yet obviously they do. While the amount of waste caused by this dichotomy is staggering in terms of safety, poor reliability, and early system replacements, little is being done to prevent or even understand it. A case in point is the long standing fifty-percent No Fault Found (NFF) rate in the testing of aging avionics systems that has for decades defied all attempts to reduce it.
In this paper, we apply Occam’s Razor to examine the technical and some of the political issues associated with testing and eliminating latent intermittents. More importantly, we challenge industry mind-sets regarding the viability of functional testing as an indicator of reliability, and we question the wisdom of applying fixed ”cradle-to-grave” testing philosophies when evidence shows that the predominate failure mode changes over the life of a system.
Because testing requirements do not change to keep pace with the changing failure modes, latent tester-void defects accumulate resulting in decreasing reliability and eventually to early system replacements when that system can no longer meet its operational requirements and is erroneously assumed to be worn out. To validate this concept of tester-void defects, we examine the internal measurement mechanics presently in use and show why these methods are not able to detect age-induced intermittent failures. As an enabling solution, we introduce a new testing paradigm developed by Universal Synaptics that fills this long-standing void and provides a direct testing means for long-term reliability.
It is fairly easy to accept that vibrational stress and corrosion in the connectivity elements (connectors, wires, splices, switches, solder joints, etc) can cause the functionality of these often intermittent-by-design and other elements to progressively get worse over time1. It should also be considered that this aging degradation is a rather normal process with little that can be done to prevent it without engineers taking rather extreme measures to alter the environment in which these electronic elements are used, the way they modularize various components, and the materials they use to connect these complex systems.
Judging from current testing practices however, it is perhaps very difficult for test engineers to understand that as a result of this aging or “degradation over time” phenomenon, rather than a steady ohmic degradation eventually leading directly to hard failures, something quite different actually occurs. The failure mode most likely to be seen in aging systems is randomly occurring intermittent discontinuity, which when explicitly tested for with continuity, functional, or other single-point-in-time test equipment, generally show no signs of this age- related ohmic degradation.
As an example of how this phenomenon occurs, consider that at the micro-surface of a connector, wear is caused by vibration and use. Given enough time, the peaks, which are actually making the electrical contact, gradually wear off, and the valleys fill up with this and other oxidized and insulating debris. Only part of the surface area is actually making electrical contact at any given time. This well-documented aging phenomenon is called fretting corrosion.2
At any given time then, there are a number of random micro-matings of peaks mating with peaks, making good connections and matings involving some measure of insulating debris making imperfect connections. Due to ambient stress factors such as vibration and thermal expansion, these mating surfaces can actually slip, momentarily forming a new set of micro-connection sites. Should the sum of these new random matings become predominantly ohmic at any instant, and remain at a sufficiently high level for a sufficient period of time, they can cause random continuity glitches, which in turn can cause a momentary loss of signal or data. These intermittent defects are seen during operational use as anomalous system failures and during subsequent testing as NFFs.
Over time, some of these constantly degrading connectivity elements may eventually become severe enough to be considered hard failures which may then be detected during normal testing and eliminated with conventional testing processes. However, until that time, each individual failure site may exhibit randomly occurring intermittent failures with an ohmic level and duration that reflects both its own particular aging profile and the ambient, stress-induced random positioning of its contacting surfaces. If the failure is intermittent, its exact performance before, during, or after any anomalous events cannot be predicted, nor is it necessarily repeatable.
While the above describes the molecular action of a connection of some type producing an “open” type of intermittence, the same action or degradation-over-time phenomenon can be applied equally in cases where wire insulation may be slowly and progressively chaffing. This chaffing will eventually lead first to insulation quality degradation, micro-shorting and arcing, and eventually, with enough additional wear, to a hard short with possible catastrophic consequences. The same functional and testing problems as outlined above apply as ambient stress plays a part in triggering the early-on random intermittent failures.
PLAYING A FOOLS GAME
Because of the unpredictable randomness of where, when, and to what degree an intermittent fault will occur, coupled with the complexity of modern avionics systems being operated in safety-critical environments, it would be a mistake to believe that any intermittency, whether of an opening or shorting variety, could be segregated into classes of “good” and “bad”, or that judgements could be made as to the safety-criticality of any failure events with humans involved, by targeting some for repair immediately, some for routine maintenance later, or as is usually the case with NFF, no maintenance whatsoever.
Similarly, it should not be assumed that testing with any traditional or experimental tester can determine a circuit’s real physical fitness or health if the item in question is in a “good” state during the exact moment of testing. If enough of the connectivity element’s micro-surface is meeting tester-dependent acceptable limits during the test, there is absolutely no way that its other probable intermittent failure states or its inherent reliability can be determined. The problem must be present, at a critical level, at the exact time of testing to be seen. Claims to the contrary by opportunistic equipment vendors anxious to cash in on the “mysteriousness” of the NFF/CND(Can Not Duplicate)/Aging phenomenon should be reviewed carefully.
In 1993 the Air Force, to its credit, recognizing its testing deficiencies, asked test equipment vendors to submit bids for a CND/NFF tester, to which the amused test equipment industry responded by explaining the relationship between “Can Not Duplicate” and “testing”. At AutoTestCon94, after fixing thousands of these otherwise CND problems, we explained to a capacity audience, exactly what CND was with a presentation of an earlier version of this paper and a demonstration of the testing technology. Later, at AutoTestCon98, after decades of complacency to the needs of their customers, nearly every vendor’s booth was now suddenly advertising a cure for the CND problem, but with the same old technology they had been hyping decades before. The very reason for the CND/NFF problem was now mysteriously its cure. The past president of the Airline Maintenance Conference, Peter Fussinger, remarked recently that he now suffers at least one “snake-oil” proposal a week from a vendor claiming to have “all the answers” to his testing problems. He and others in the maintenance industry are subsequently very cautious about wasting their time looking at any new equipment and finding only unsubstantiated claims3. Unfortunately, this environment of cynicism makes it difficult for innovative and workable testing technologies to be considered and accepted.
Of course, some tester claims concerning CND/NFF, are not totally invalid. A Continuity, ATE, TDR, SWR or any other present or experimental tester can see an intermittent problem if it is bad enough. It is a matter of degree or efficiency in detecting and resolving these developing age- related defects that is important. With so many exaggerated claims in the developing, age-related testing market place, some foundation for advertised claims is imperative for safety conscious maintenance professionals to make wise and appropriate testing choices. All single-point-in-place-and-time serial testers, advertising NFF/CND capability, are more properly classified as hard failure testing instruments only; their performance suffers when the fault is intermittent.
By definition alone, a connector, wire or other element is good if it meets its prescribed engineering specifications for ohmic or other quality at the instant of testing. However, no determination as to reliability can be made without actually testing for the presence of expected intermittency with equipment that can actually detect it to some very high degree or standard of efficiency.
Replacing legacy test equipment with new but similar measurement technology may be required when the equipment is no longer supported by the vendor, but it is not an action that should be taken when a system reliability upgrade is the primary goal. A new ohm or other meter is going to provide the same data as the old one, so of course, the reliability, or the lack thereof, of the tested units will remain the same.
Some testing engineers actually believe that having no intermittency or aging standards equates to having no need to test for it. Maybe they should consider instead that the proper interpretation of a lack of standards is that no level of intermittency is acceptable.
WHAT ARE THE STANDARDS?
At issue then is: What standards or specifications should be used as criteria for determining the continued suitability for acceptance in the face of ongoing deterioration? Generally, the design engineer selects and derives his testing requirements from the component vendor’s published specifications. One must then question how the component vendors have been developing those specifications without having intermittency testing equipment, and how practical are their methods outside the testing lab?
Typical connector life testing may involve the repeated “rubbing together” of a set of pins thousands of times until wear and fretting corrosion builds up to a level exceeding some speculative limit such as one ohm. It would be a mistake to believe that you could get this number of cycles in real use, or that there is no need to test because of the assumption that a given connector will never be cycled that many times.
A “one ohm” specification is highly arbitrary and is circuit dependent4. It is simply a “rule of thumb”. A one ohm resistance on a 100 amp power bus in an aircraft is an insulation fire due to I2R heating that melts the wire’s insulation attached to the super-heated bus and causes them to arc and short together. At the other end of the spectrum, a one ohm change in resistance on the input pin of a high impedance circuit will never be seen as a problem during functional testing or while in operation. Only by specifically testing for these age related deviations can you detect developing failures.
Many well-equipped repair depots, with testing stations set up to continuity test large LRU chassis and motherboards, sit in disuse. Because of their inability to find known or suspected developing intermittent problems with hard failure type equipment, technicians skip this part of testing, relying on the full-up functional test alone to verify proper connectivity. This lack of confidence in their equipment, in turn, leaves many undetected hard failures to cause borderline failures during operation.
Probably no engineer at the present time is going to spec every connector pin in his designs and testing requirements, however, because of all the age-related wiring problems, some CAD electronic design packages are seeing the merit in doing so. This one-ohm “rule” is at best only a way to compare various connector designs and materials for relative resistance to wear and reliability.
Testing labs perform specification testing often using testers that cannot see intermittencies efficiently, so consequently, they may take millions of samples over a period of weeks falsely assuming that if any defects occur they probably will detect them. They then apply statistical means on these millions of samples to derive estimates of durability. Even if actual anomalies are “measured”, statistics are applied to this failure data to relegate many of these failures to the “false failures” heap. In addition, a specification “cycle” does not necessarily mean the “taking apart” of a connector. It may more correctly refer to the aforementioned “rubbing” or micro-motion wear cycles, which could occur any number of times due to vibration and thermal expansion during use and even shipping. We have seen new connectors with excellent cycle ratings become intermittent and fail in multiple circuits in as little as one day of HALT testing.
Due to the randomness of these intermittent events in both time, place, and severity, traditional testing technologies, measuring one point at a time using test point scanning and digital sampling methods, are not effective in detecting intermittent connection problems. Their main “benefit” may be a false sense of security. In multi-interconnect systems, it is simply a matter of sensitivity and probabilities. Adding vibrational or other environmental stress during the testing process in order to increase the odds of detecting latent intermittencies may still be of little or no value without the use of proper detection equipment. Applying excessive amounts of stress to compensate for certain measurement deficiencies with traditional testing can damage sensitive circuit boards and components, and actually compromise the reliability of the tested unit.
OTHER PROBLEMS WITH THE PROCESS
In older avionics systems, 50% of all pilot-reported operational discrepancies go unrepaired. These undetected intermittent defects are currently being labeled and disguised as Can Not Duplicate (CND), No Fault Found (NFF), False Removal (FR), No Evidence of Failure (NEOF), No Problem Found (NPF), Cannot Verify (CNV), Retest OK (RETOK), and other repair diagnostic descriptions. These descriptions and interpretations imply that a problem never existed. They suggest that the UUT was erroneously removed and the pilot or initial technician may have made a mistake, or the problem amazingly disappeared. What must be clearly understood is the fact that somewhere along the line, a problem was encountered; whether it was during regular operation, diagnostics, or maintenance. A UUT would not exist if a problem had not been encountered. These interpretations and misnamed descriptions, along with well-intentioned studies and fragmented solutions, have caused the root of the problem, intermittency, to be overlooked. These labels are, in fact, a large part of the problem.
As an example, when a pilot experiences a problem during flight, he relates the symptoms to the ground crew at the debriefing. Upon examining the Built in Test (BIT) and troubleshooting the aircraft, the ground crew either duplicates the reported symptom or they do not.
If they can duplicate the problem, the suspect Line Replaceable Unit (LRU) will be removed and sent to depot repair. However, if they are unable to duplicate the problem, they have encountered a "flight-line" CND/NFF/intermittent. To clear the malfunction write-up, they may pull a suspect LRU and send it in for depot repair, not truly knowing whether or not the LRU is defective.
Additionally, the maintenance crews have no intermittent fault detecting equipment capable of isolating a problem that exists between the aircraft's wiring and the LRU. They could manually ring out 50-100 wires, possibly stretching from one end of the aircraft to the other, but experience with inadequate equipment has taught them that they are probably going to waste a lot of time and still not find anything. Therefore, they have no expedient choice but to perform "best guess, shotgun maintenance," which often results in the replacement of one or more perfectly good, yet suspect LRUs.
Suspect LRUs that are removed from the aircraft are marked with a malfunction code indicating a hard failure and sent to the depot repair shop (the correct code should indicate an intermittent condition or CND/NFF). The depot LRU shop will fare no better than the ground crew at locating an intermittent condition with their Automatic Test Equipment (ATE)5. Even if their ATE could detect an intermittent problem during functional testing, isolation to the exact circuit card or chassis wiring will be difficult if not impossible because of the requirement for repeatability during diagnostic testing. Because of this lack of failure evidence or repeatability, they may send one or more good, yet suspect, circuit cards to the Shop Repairable Unit (SRU) shop for more “in-depth” testing.
The depot SRU shop will face the same problems in locating intermittent defects as the LRU shop. The suspect circuit cards will be “tested” and returned to the LRU shop unrepaired. The LRU shop will then install them in different LRUs. These LRUs will in-turn, be sent to other flightlines where they will be installed in different aircraft. Diagnostic chaos ensues when an intermittent LRU from supply is installed as part of a diagnostic procedure, and it contains the same or a different problem as the LRU that was replaced.
At this point, it is imperative to understand that no functional, BIT, BITE, border, continuity, diagnostic, SWR, FDR, TDR, exotic, or any other test will find a stress-induced intermittent problem until the intermittent event actually occurs. Conversely, when the event does occur it may be just a short, one-shot event and at an extremely low level; it might not even register as a problem. When it occurs, you need to have sensitive, simultaneous, and continuous testing in place to catch it.
Many complaints about the inadequacy of BIT tests and other on-board testing schemes are often misdirected. These on-board tests typically run numerous times during a flight, and often pick up these intermittent problems and accurately report them.
With multi-level testing, the philosophy is that the further away from the aircraft you get in testing, the more involved and accurate are the tests. However, with intermittents, the exact opposite applies. On-board, an intermittent circuit may get tested hundreds or thousands of times, but while on the ground, possibly only once briefly.
Therefore even though defective LRUs may get pulled from the aircraft on the advice of BIT, the problem, being intermittent, is most often not duplicated with any on-the-ground, brief, one-time testing, so BIT gets the blame that ATE deserves in this case, and yet another NFF is logged.
Programs that track Aircraft and LRUs for repeat failures (intermittents) or use previous repair information (enhanced diagnostics) to improve the process have fundamental flaws that should be recognized. Since erroneous (hard) malfunction codes are being reported as a result of "shotgun maintenance", the basic information on which these "improvements" rely is faulty. Another problem with tracking programs is that at least two or more, and likely dozens of repeat aircraft failures are required to develop any useful patterns of intermittency. Even if any onboard or flightline program could identify an intermittent unit, there is still little chance that the LRU or SRU shops, equipped with only “hard failure” detection equipment, could detect and repair the exact cause of the intermittent fault.
Without a direct testing solution to get intermittents out of the operating systems and spares pools, all of these other well-intentioned testing and tracking programs only give a false sense of security, and may be as big a problem as the one they are trying to eliminate.
WHAT EXACTLY IS AN INTERMITTENT?
An intermittent is any temporary deviation from nominal operating conditions of a circuit or device. In addition, any circuit or device exhibiting such a deviation can also be called an intermittent. This definition encompasses both the media-popular “short circuits” as well as the more numerous yet less understood “open circuits”.
Shorts are generally problems of abuse and neglect that never should have been allowed to happen. We know how to prevent it, but we do not. At its root level it is an issue dealing with management, engineering, maintenance and economics. In the second case, “opens”, we are talking about problems resulting from normal wear and use. In both cases, we think we are testing for the problem, but with such high NFF rates and other aging/safety problems, obviously we are not.
WHAT EXACTLY ARE WE TESTING?
There are only two types of electronic failures, hard failures and intermittent failures. Hard failures are detectable every time the unit is used or tested; everything else may be considered to be an intermittent.
While this concept is easy to understand, the confusion surrounding CND/NFF/AGING arises because there are three basic types or causes of intermittents: Engineering, Test Void, and Connection.
Engineering or Design intermittents are encountered when a normal operating event causes a circuit to temporarily deliver a wrong output. This occurs as a result of complex interactions between various system components and is often related to specific timing events. These defects include switching transients, induced EMF, load changes, ground loops, cross-talk, leakage through circuit boards and conformal coatings, software, or poor initial design.
Engineering intermittents are generally evidenced as a syndrome where all like units in a system are occasionally failing under some similar operational or testing circumstances. These intermittents are often difficult to isolate and correct. However, because they are usually fixed during the beginning of operational use, they are usually not a major factor in the overall problem of intermittence, especially in older systems.
When an individual UUT continuously fails to perform its function in the operating system or fails a high-level test, and the malfunction is not detectable at a lower level of testing, it is often referred to as a Test Void intermittent. This type of failure should be thought of more as a “hard failure” with an “intermittent” overall testing program and should not be confused with true age-related intermittent failures that have “good” testing programs but simply cannot detect failures unless they actually occur during the testing process.
Because individual units exhibiting “Test Void” intermittence, whose failures can always be repeated at some level of testing, do not get reinstalled on the aircraft until they are fixed, they do not hazard the aircraft. This “hard” class of failure, Test Void intermittence, is easy to fix by correcting the lower level test programs to detect the actual failing components. This category of intermittents, however, often receives the most attention by engineering because the problems are isolatable, fixable, and the results are quantifiable; it is easy.
Some engineers, especially on older aircraft systems, incorrectly include Engineering and Connection intermittents in the Test Void category. They assume that because their test equipment or test program cannot find all or as many pilot or aircraft reported problems as when the aircraft was new, then it must now be deficient in some manner, i.e. ”worn out”. To resolve this assumed tester obsolescence, they buy new, but same-technology testers or invent other work-arounds such as tracking and database “solutions”. They fail to realize that the types of failures have changed over time and that no test or diagnostic procedure can detect and isolate a random intermittent problem which does not occur at the exact instant it is being measured or tested.
In most cases, the occurrence of Engineering and Test Void intermittents should decrease over the life of a system as their root causes are discovered and resolved.
Connection intermittents are caused by a temporary change in a circuit's continuity path. The root causes of these changes, or breaks, range from contact fretting to the more familiar types seen as loose (cold) solder joints, oversized or worn connector pins, corroded or oxidized connections, noisy components, loose terminal screws and a host of other similar causes. These types of defects can occur at any stage in a product's life, based on the accumulated amount of wear or stress encountered in the system's operating environment. As an example, fretting corrosion can be caused by small slippages of a connector’s micro-contacting surfaces, triggered by as little as a few degrees of thermal expansion2.
It would be incorrect to assume, however, that these intermittents are constant or repeatable. A micro-break type intermittent one instant may easily become a wide-open break the next, due primarily to the instantaneous random alignment of the hills and valleys at the micro-surface of the contact material and the ambient stress at any given moment.
Because Connection intermittents in a general sense grow worse over time and are rarely repaired, they eventually become a major cause of failure in older electronic systems. When they begin (stage-1), they are seen as small, short duration fluctuations, voltage drops, or electrical noise. They generally remain undetectable or unnoticed in their early stages. As the amplitude and duration of the fluctuations increase (stage-2), random system failures begin to occur. Because of inherent measurement technology limitations, traditional testing methods such as ATE and continuity testers cannot catch these random intermittent events until they reach stage-3, or until they progressively become hard failures.
Connection intermittents have a pronounced negative affect on reliability. Unrepaired intermittent units continually cycle through the system, causing even good units to be sent in for repair. This is evidenced by a decrease in a system's Mean Time Between Failures (MTBF) and an increase in Can Not Duplicate (CND/NFF) failures. If they are left unresolved, entire systems might have to be replaced as maintenance costs skyrocket. This lack of proper testing for connection intermittents is often the main reason that electronic systems are considered “worn out”.
PROBLEMS WITH TEST EQUIPMENT
The test equipment used in avionics testing, such as Automatic Test Equipment (ATE), In-Circuit Testers (ICT), TDR, SWR and Continuity testers generally do an excellent job of verifying operational specifications and diagnosing hard failures. However, they fail to deliver anything close to the same performance when the problem is intermittent. Because of their inherent “Apply Stimulus / Measure / Compare” technology, most of the test time is spent setting up stimuli, and very little time is spent actually measuring. Also, they can only measure one circuit output function at a time.
The probability that these types of testing devices will be reading the intermittently failing test point at the precise time a developing failure actually occurs is quite small and can be easily calculated using the testers operational specifications. It is not unusual to find ratios of a million to one when comparing what a conventional tester is actually doing in relationship with what it should be doing.
Another problem with ATE is that testing is usually performed as quickly as possible to reduce cycle time and shop costs and is usually done with no applied stress. This is in direct contrast to how random, stress-induced intermittent defects actually occur while the system is in use. Under normal testing procedures and conditions, it may take hours or days of system testing for the failure to repeat, if it repeats at all.
To increase diagnostic granularity, ATE test programs often partition the UUT model into small units, which test only a small portion of the UUT's circuitry at a given time. If an intermittent occurs in a circuit that is not under-test at a particular time, the intermittent event obviously will not be detected. Even if ATE should happen to be testing the right output at the right time and an intermittent failure is detected, the exact cause of the problem may still not be found or repaired. For ATE to be effective, the failure must be repeatable and in sync with the ATE test program when the technician back-probes the circuitry to isolate the defect. If it is not repeatable, as it would be with a hard failure, the original detection may have limited or no value, and after a large investment in additional testing and diagnostic time, the circuit may still go unrepaired.
In addition to the above deficiencies, most ATE in use today is not sensitive enough, by design and programming, to catch small (stage 1&2), random events riding on output signals. Signal filtering and measurement averaging compromise the measurement function allowing many small developing intermittents to escape undetected. If a connection is physically loose but still making electrical contact, such as loose solder joints, dirty relay contacts, frayed wires, or loose connector pins, it is rarely discovered during functional or component testing.
To compound the situation further, in some ATE functional testing programs, intermittent failures are actually assumed to be false failures. Rather than report the intermittent defect, the testing software will actually go into a programmed retest loop and actually wait for a period of time for the failing signal to finally correct itself.
Another phenomenon that often occurs while testing LRUs and SRUs is that technicians will temporarily "fix" connection problems by reseating connectors and circuit cards in the LRU or UUT fixture. As a response to production pressure, they may also try to pass bad units by attaching weights to the test cables to insure that a skew bias in the UUT-to-tester pins will likely exist thereby helping to make contact. These “repair actions”, which may temporarily force connectivity or remove such insulating films as fretting oxidation and corrosion, often only establish continuity long enough for the unit to pass testing requirements but ultimately result in unknown or compromised reliability, and are left to fail again when placed back into use.
In some circuits, excessive connector contact resistance, caused by the accumulation of these insulating films, is critical when conducting low-level signals. To insure proper testing, dry-circuit test stimulus of approximately 20-100 mV and currents less than 100mA should be used. If these limits are exceeded, the insulating film could break down and continuity could be established during testing but fail when in use. Some continuity testers in use now are applying higher testing currents, often up to two amps and more, to verify other continuity parameters while ignoring tests for dry-circuit electrical intermittency.
Some cable testers that claim to be able to catch intermittents use high speed, five-volt, digital pulses for stimuli, which is totally inappropriate for dry-circuit testing. These high-speed pulses may also be easily attenuated when applied to long aircraft harnesses with their attendant inter-electrode capacitance and mutual inductances. Lowering the testing stimuli pulse rate does not help much either as more of the high-speed CND/NFF type intermittent events will be missed. Another problem with digital cable testers is that they will only see an intermittent event when the circuit completely opens. They may totally miss the more numerous ohmic events.6
On some critical systems, environmental stress chambers are used to vary the operating temperature and vibrate the UUT. This applied stress is used to increase the probability of triggering a latent intermittent event during the testing window. However, if the proper test equipment is not used to monitor the UUT during stress testing, the probability of detecting an intermittent failure may still be too low to warrant the extra time and costs involved in duplicating operational conditions.
Applying environmental stress has other problems.
Generally, connecting the UUT to the ATE is more difficult, diagnostic probing is often impossible, and the UUT may be damaged or weakened during the process. Sometimes the test cables themselves may introduce new problems in some circuits due to increased resistance and other impedance changes.
When you consider that a lot of circuit continuity components such as connectors, relays, and switches are intermittent by function and design, that all are naturally getting worse over time, and that there are many more of them than all the other circuit components combined, coupled with the difficulty of testing their electro-mechanical properties, it should be no mystery why a CND/NFF/Aging problem exists.
The worst of the constantly developing intermittent problems are caught when they reach a detectable point in their natural life cycle, while the intermittents responsible for CND/NFF/Reliability problems are not. We may believe we are seeing all of the intermittents while testing, but in reality it is often only a small percentage of those present.
If the detection and repair of intermittent defects is important to your operation, or is critical, as in the case of multi-level aircraft maintenance or high cost Space hardware, then a special type of tester must be used.
Rather than measuring statically for continuity via digital measurement techniques and then converting this data to engineering units so the measured values can be compared against previously stored limits of acceptability, a NFF tester must be designed from the ground up to look for random discontinuities that may appear as anomalous events in both time and place. The measurement technology involved in these two very different types of “connectivity” testers is radically different.
The detection of age-related intermittents in chassis boxes, card cages, connectors, motherboards, wire harnesses, etc., requires this special type of tester. This tester must monitor all the connections simultaneously and continuously while looking for any discontinuities. It must use low DC voltages and currents for stimuli, yet be extremely sensitive to small ohmic changes in both magnitude and duration. It must be able to perform in areas with high background noise, such as a repair shop or an aircraft avionics bay, and with no special shielding, no filtering, and no false events.
Walter Shawlee II, in his column for Avionics Magazine, “How Parts and Systems Age” said: “A key aspect of these interconnect and wire related failures is that they often defy detection by the traditional one-path-at-a-time sequential mode of analysis. This not only fails to spot the problem under vibration (a time-dependent failure), but also ignores many combinatorial faults that occur between wires and other surfaces on an erratic basis. Only massively parallel and true analog analysis can even hope to detect and correctly identify this problem”7.
THE DIRECT TESTING SOLUTION
Universal Synaptics has developed and recently patented the very tester Mr. Shawlee describes. It was designed from the ground up to accurately detect and isolate those age-related intermittent problems that normally would result in another NFF statistic. It was built using a unique mix of both analog and digital technology, modeled to emulate a proprietary software Neural Network, but to do it with hardware in hard-time. It meets all the special requirements needed for randomly occurring, anomaly detection, something virtually all other testers are not able to do.
The IFD-2000 Intermittent Fault Detector has the unique ability to simultaneously and continuously monitor hundreds or even thousands of test lines, with high sensitivity for intermittent defects, with absolutely no scanning of the test points, and no digital sampling. Because there is no scanning and no digital type measurements with which to contend, it has no limiting test rates, which of course results in no missed intermittent events. Its low level 0-3.5 Volt programmable stimuli allows for safe in-situ testing of critical circuits, and its super high sensitivity reduces the need for high levels of damaging vibrational stress.
The IFD-2000 is a computer operated tester/analyzer that employs a proprietary, "massively parallel” analog, hardware neural network to perform real-time data reduction via sensor-fusion techniques and digital circuitry for the purpose of high-speed latching of events and for computer interfacing. The basic unit monitors up to 256 single-ended lines or 512 double-ended connections simultaneously. It can be expanded to include multiple 256 input sensor modules with absolutely no reduction in basic measurement capability.
The IFD-2000 has two main operational functions: Intermittent Fault Detection and Signature Analysis. The most commonly used is the Intermittent Fault Detection function. This function tests a unit for randomly occurring, real-time intermittence in any of the connected input lines. "Automatic and Program" are two user-selectable sub-modes that determine basic operation.
The Automatic mode uses a heuristic reasoner to monitor the test progress and adjust stimulus and sensitivity levels as required for optimum performance. In the Program mode those values are set by the operator and remain fixed throughout the test.
Both the Automatic and Program modes perform their tasks by providing a programmed stimulus source to all connected and programmed lines. Through sensor-fusion
techniques, data from all lines feed into monitoring and decoding circuitry. When an intermittent event occurs, the address of the active sensor is passed to output-latching circuitry. The active address is then imported into the computer, and a score is incremented for each active address.
The accumulated scores for each address or test point are used to drive a real-time graphics representation of the test progress and results. At each detected event, the screen display is updated and a computer-synthesized voice describes the location of the intermittence in UUT terminology. At the end of testing, a tabular display or printout is available.
The second main operational function performs Signature Analysis on all or selected lines of the UUT. Two sub-modes, SCOPE and SWEEP are employed in a more traditional measurement function to collect data that can be used for real-time monitoring and for comparison purposes with previously stored signature data. The SCOPE sub-mode operation is performed by applying a series of fixed pulses in parallel, up to 16 lines at a time to the UUT. In SWEEP sub-mode, a proprietary wide-band stimulus is applied to trigger a response from any reactive components to the pulse’s RF component. In both modes, the response or interaction of the UUT with the neural network is monitored and collected. This data is then passed to the computer's display device for a scope-like data presentation and manipulation. This data can alternatively be used with the IFD-2000’s software neural network for training, continuity verification, and signature-based diagnostics.
In operation, the IFD-2000 delivers positive evidence of an intermittent. It provides a continuous and simultaneous analog testing method, which more closely matches the operation of the UUT in the real world where the original intermittent event occurred. Not only does it detect an intermittent problem, but also it inherently diagnoses the problem in the process. Small intermittents are not "swallowed" by ATE filtering, they are reported and can be immediately fixed. Today's undetected intermittents will not become tomorrow's system failures.
Because of the IFD-2000's high immunity to externally induced background noise and ground loop problems, it is extremely sensitive to changes in static conditions and can detect anomalies at a much lower level than other test devices in a shop or system environment.
The internal operation is completely programmable, allowing the operator to set stimulus and sensitivity levels, as well as the number of lines to be tested. The IFD-2000 provides two digital outputs, one representing neural network activity, the other representing the address of the intermittent connection.
Portable versions of the IFD-2000 are small in size and lightweight enough to be used in aircraft cockpits or other space limited environments where cables and connectors are used. This capability allows for intermittent testing where the failures are occurring. Because of its high input impedance, excellent in-situ capabilities and parallel testing characteristics similar to in-flight operations, this tester could also be a good candidate for permanent on-board wire testing in legacy and future aircraft, with little additional development effort.
INSTALLATION AND OPERATION
The IFD-2000 is extremely easy to setup and operate. To setup, simply connect lines from the IFD-2000's sensors to the circuits to be tested. Multiple UUT connections may be tested using a single sensor, by daisy-chaining the UUT circuits serially. This gives an unexpanded, 256 input model the capability to test hundreds or thousands of circuits simultaneously with, of course, diminished addressing capability.
After connecting the UUT to the IFD-2000, a connection list or software Map file is then developed by describing which UUT circuit was connected to which sensor. This Map file drives the computer and voice display to show and tell the operator which circuits are failing. Once the IFD-2000 has been connected to a UUT, the rest of the operation is totally menu-driven. Either automatic or manual testing can be performed.
In INTERMITTENT mode, the IFD-2000 applies a steady, programmed, low level excitation voltage to all of the connected lines while the tester monitors these lines for any changes in connectivity. As the test proceeds, the operator typically applies a mild vibrational and/or thermal stimulus to a section of the UUT. If the slightest connectivity change occurs, the IFD-2000 notifies the operator about the intermittent and its circuit location or other description by synthesized voice and graphics display.
As one way to effectively tell exactly where the wire or circuit topology is intermittent, the operator may wish to apply the same mild vibration/thermal stimulus to a different physical location on the UUT and observe the response. By selectively moving the physical stimulus, the defect is isolated to the approximate failure location. If wire type and length are known before testing, future software upgrades will measure and calculate the distance to a failure site, similar to other pulse based technologies. At the end of testing, a summary report that details the results of the test is generated.
ADDITIONAL TESTING CAPABILITY
In addition to the main testing modes described, the IFD-2000 has a Dry-Circuit test mode where the DC stimulus is swept through its programmed ranges while monitoring the UUT for any connectivity dropouts. These dropouts may occur due to the inability of the stimulus to “punch-through” any insulating films that may be on the tested component’s contacting surfaces.
Because of the IFD-2000’s high sensitivity, the tester may also be used very effectively as a detector for insulation breaks and weak areas by using a containment method where ionic solutions are painted or flooded on to areas of wire bundles that may be inaccessible or difficult to inspect. Similarly, EMI and cross-talk problems may be analyzed with the appropriate application of an external stimulus. As the IFD-2000 matures, undoubtedly, additional testing modes and capabilities will be discovered and formalized.
A simple, yet significant, study was performed to demonstrate the scope of the intermittent problem in a given unit, and to determine the relative effectiveness of various other types of test equipment in detecting and isolating known intermittent defects on fighter avionics.
The unit selected was an A22 Oven module that is part of an inertial navigation computer. This SRU was selected because of its aging profile, low MTBF, and high defect rates.
Intermittency testing was first performed on a total of 26 A22 Oven modules using the IFD-2000. We found 14 of these 26 modules to have one or more intermittent connections. These 14 intermittent modules were then tagged as “INTERMITTENT” to alert shop technicians. However, the exact cause of the intermittent failure was not listed.
Normal required testing was then performed at both the LRU and SRU diagnostic levels and repeated several times if no defects were found. The result was that SRU (ATE) testing passed 12 of the 14 defective units as OK for an inherent 86 percent NFF rate for ATE type testing. The 12 passing units were then sent to the LRU (simulator type testing) shop as part of the normal repair process, where diagnostic testing identified problems with 8 of them for a inherent NFF defect rate of 33 percent.
Note 1: Because the technicians at both levels knew the modules had intermittent defects, they spent additional time testing the units, in some cases hours longer than normal.
Note 2: The SRU and LRU are analog test stand / simulators that employ strip chart recorders to catch any anomalies, and each unit was tested for at least 90 minutes.
We believe that had routine testing occurred without alerting the technicians, the number of intermittents discovered would have been considerably lower.
Present testing methods such as functional and continuity testing which use test point scanning and measurement sampling to collect certain engineering parameters are useful for diagnosing and fixing the “hard” or permanent failures, and for verifying that a “good” unit meets certain prescribed engineering standards of performance. However, these technologies do very little for the often more numerous and more serious age-related, randomly-occurring intermittent failures.
Even though test equipment now exists that can easily find and eliminate these latent age-related defects, decades of testing without the benefit of this new technology has created an erroneous mindset among testing engineers and others that “system wear out” and “long term reliability” is a problem for design engineers only. They seemingly reject the notion that testing directly for intermittent age-induced failures could play a major role in reversing the “aging” process.
This obvious testing omission concerning expectable and known intermittent defects creates an interesting situation that requires immediate attention. With no standards, some would argue there is no requirement to test, yet it is also equally arguable that with no testing, there is also no proof of the exercise of responsibility towards testing professionalism, consumer expectations, and passenger or product safety.
We believe that intermittency testing is a necessary and extremely cost effective way to make major improvements in systems reliability, to greatly reduce life cycle costs and to protect the enterprise from unnecessary risk.
Intermittency testing does not require a “melt-down” of established testing programs or philosophies. It only requires the addition of yet another testing tool to augment the huge arsenal of testing devices already in place. We freely acknowledge that on-board testing of aircraft wiring presents some challenges due to the scope of the problem and the aircraft’s complex geometry; however, challenges have accompanied every worthwhile endeavor. These challenges, of course, would diminish greatly if the IFD-2000 was used extensively in the LRU testing process and/or permanently wired into the aircraft.
While the IFD-2000 is only a small addition to the arsenal of testing tools, it is huge in scope and potential in that it attacks and resolves fully half or more of all reported defects in aging systems, and it is the half that all the other testing has been missing.
Portions of this paper have been updated since its original inception in 1994 to include new developments and to answer questions not fully explored in the original version. 8-14-2001
1. Jochen Horn, Fritz Kourimsky, Kurt Baderschneider, Harald Lutsch, AMP Deutschland GmbH, “Avoiding Fretting Corrosion by Design” AMP Journal of Technology, Vol. 4 June 1995
2. Piet van Dijk, Frank van Meijl, “Contact Problems Due to Fretting and Their Solutions” AMP Automotive Development Centre, AMP Journal of Technology Vol. 5 June, 1996
3. Correspondence with Peter Fussinger, Chairman AMC., On-file.
4. Jochen Horn, Kurt Baderschneider, Bernd Lippmann, AMP Deutschland GmbH, “A New Criterion for Dynamic Reliability of Contacts” AMP Journal of Technology Vol. 2 November, 1992
5. Gary L. Gemas, “Aircraft Avionic System Maintenance Cannot Duplicate and Retest OK Analytical Source Analysis" Master's Thesis, Air Force Institute of Technology (DTIC), WPAFB OH, September, 1983
6. Steven Dunwoody, Edward Bock, John Sofia, “A Practical and Reliable Method for Detection of Nanosecond Intermittency”, AMP Journal of Technology Vol. 5 June, 1996
7. Walter Shawlee II, ”How Parts and Systems Age”, Avionics System Design, Avionics Magazine, November 2000.