Memory Testing and Self-Repair

Mária Fischerová, Elena Gramatová

Institute of Informatics of the Slovak Academy of Sciences, Slovakia

Abstract

Memories are very dense structures andtherefore the probability of defects is higher than in the logic and analogue blocks, which are not so densely laid out.Thus, embedded memories as the largest components of a typical SoC- up to 90 % of the chip area dominate the yield of the chip. As fabrication process technology makes great progress, the total capacity of memory bits increases and will cause an extension in area investment for built-in self-testing, built-in repairing and diagnostic circuitry. Many test and repair techniques are used in industry but the research results offer new methods and algorithms for improving digital systems testing quality. The purpose of this chapter is to give a summary view ofstatic and dynamic fault models, effective test algorithms for memory fault (defect) detection and localization, built-in self-test and classification of advanced built-in self-repair techniques supported by different types of repair allocation algorithms.

INTRODUCTION

Large and embedded memories are designed with more aggressive rules than those for the logic on the chip, so memory defect densities are typically twice that of logic. Memories are dominating blocks of nowadays and future SoCs (80-90 % of a standard chip) and therefore testing and repairing them is the key point inachieving acceptable SoC yield. A post-manufacturing repair of faulty memory blocks using redundant rows and/or columns contributes to thehigher SoC manufacturing quality. Redundancy involving extra rows and columns can substantially increase the memory yield. Currently, spare elements (redundant rows and columns) have become an integrated part of most embedded and commodity memories.The faulty cells can be repaired immediately after manufacturing but still unused spares can serve for replacement during the life-time of systems on chip making them more fault-tolerant. The repair process based on testing outcomes and on the repair analysis is done by external devices or by internal blocks integrated directly on chips.

In memory testing march type algorithms are mostly employed due to their linear complexity and high fault coverage. The march algorithms are scalable and flexible for different types and sizes of memories, also for various fault types, diagnosis and test application time requirements. The algorithm is composed of test elements (an element consists of writing and reading operations over memory cells) and applied by using external automatic test equipment (ATE) or a built-in self-test architecture linked to the memory. Test results obtained are an input to a repair allocation algorithm and processed after finishing the whole test or consecutively if one fault is localized. Developed repair allocation algorithms are based on analysis of a global (or local) failure bitmap created during the testing process and the spare allocation has to be done after finishing the whole memory test. The main feature of the advanced repair allocation algorithms is an ambition to find a replacement solution without using any failure bitmap that has to store too much information when the memory capacities continuously grow.

The result of any repair allocation algorithm is a relationship between addresses of the faulty memory elements and the spare elements available for repairing the diagnosed faults. Using ATE is becoming inefficient for large memories; therefore it is very important to integrate the built-in repair analysis together with a repair technique into the chip.

The built-in self-repair architectures are suitable to be designed in interleaving modes with any built-in self-test architecture. Finding an optimal solution of an ordered sequence of rows and columns by the repair analysis for large memories is NP-complete problem (Kuo & Fuchs, 1987); therefore the repair analysis is the critical challenge for built-in self-repair and then for design of fault tolerant systems on chip. The crucial parameters for finding an optimal built-in self-repair technique are the hardware overhead of chip and the test time.

According to (ITRS, 2009) the embedded memory test, repair and diagnostic logic size was up to 35K gates per million memory bits in 2009. It contains built-in self-test, built-in redundancy analysis andbuilt-in self-repair logic, but does not include the repair programming devices such as optical or electrical fuses.

The chapter contains a short state-of-the-art of fault models, test algorithms and current built-in self-testarchitectures used in memory testing. Mainly read-write memory types are used in SoCs; therefore the chapter is aimedat thesememories. The main target is to present advanced built-in self-repair techniques based on two-dimensional redundancy and built-in repair-analysisalgorithms suitable for fault tolerant SoCs design which use local failure bitmaps as well asthose working without stored failure bitmaps.

Memory fault models, test and BIST

Fault Models

The spot defects set in the manufacturing process cause opens (representing excessive resistances within a connection that is supposed to conduct perfectly), shorts (representing undesired resistive paths between a node and a power supply or ground) and bridges (representing unwanted resistive paths between two signal lines) in a memory product. The defect analysis is essential for establishing realistic memory fault models of a different faulty behaviour at different memory locations depending on a memory layout and environment parameters (e.g. temperature, voltage, and speed) as well. The defects injected into a memory circuit are simulatedand modelled at the electrical levelto identify the exhibited faulty behaviour with respect to each defectand to acquire the corresponding fault models(Hamdioui & van de Goor, 2000, Al-Ars & van de Goor, 2003, Huang, Chou & Wu, 2003).

The memory faults are divided into three types:

  • faults occurring in thememory cell array that can be either single cell faults (state faults, transition faults, read disturb faults, etc.) or coupling faults (state coupling faults, transition coupling faults, disturb coupling faults, etc.),
  • faults occurring in the address decoder (no access faults, multiple-address faults, activation delay faults, etc.),
  • faults occurring in the rest of memory circuit (sense amplifier faults, pre-charge circuits faults, bit slow write driver fault, line imbalance fault, data retention fault,etc.), which may be modelled as stuck-at and bridging faults and are considered to be covered with any test for memory cell array faults (Ney, Girard, Landrault, Pravossoudovitch, Virazel & Bastian, 2007, Al-Ars, Hamdioui & Gaydadjiev, 2007).

The basis of any memory is a single cell into which data (logical 0 and 1) is stored. The difference between the expected and the observed value of the stored value in the cell is considered as the faulty memory behaviour. In (van de Goor & Al-Ars, 2000, Benso, Bosio, Di Carlo, Di Natale & Prinetto, 2008) a fault primitive (FP) notation is defined to describe the resulting faulty memory behaviour. FP describes a certain fault by specifying the fault sensitising operation sequence (S), the corresponding value or behaviour of a faulty cell(F) and the output level of a read operation (R) of S (if the sequence involves the read operation):

FP <S / F / R>.

The concept of FPs allows deriving all possible types of faulty behaviour for operation sequences consisted of write logical 0/1 (w0/1) and read logical 0/1(r0/1) operations in the memory. FPs can be further grouped into different functional fault model types according to the number of operations performed sequentially in S andto the number of different memory cells that are initialised or accessed by the operation(s). The static faults (Table 1) are sensitised with at most one operation (van de Goor & Al-Ars, 2000, Al-Ars & van de Goor, 2003), while the dynamic faults (Table 2) require more than one operation to be performed sequentially (to the victim cell and/or aggressor cell) in order to sensitise the fault in the victim cell. The observed faulty behaviour is related to a victim cell while the aggressor cell conduces to the fault (Hamdioui, Al-Ars,van de Goor & Rodgers, 2003, Al-Ars & van de Goor, 2003, Dilillo, Girard, Pravossoudovitch, Virazel, Borri, & Hage-Hassan, 2004).

Table 1. Static fault models

Static fault models
(none/any or 1 operation S is performed on a faulty cell)
Single-cell
Fault primitive
FP <S / F / R > / Functional fault model
< 0 / 1 / - > ; < 1 / 0 / - > / state fault (SF0; SF1)
< 0w0 / 1 / - > ; < 1w1 / 0 / - > / write disturb fault (WDF0; WDF1)
< 0w1 / 0 / - > ; < 1w0 / 1 / - > / transition fault (TF; TF)
< 0r0 / 0 / 1 > ; < 1r1 / 1 / 0 > / incorrect read fault (IRF0; IRF1)
< 0r0 / 1 / 1 > ; < 1r1 / 0 / 0 > / read disturb fault (RDF0; RDF1)
< 0r0 / 1 / 0 > ; < 1r1 / 0 / 1 > / deceptive read disturb fault(DRDF0; DRDF1)
SF1  TF WDF1 / stuck-at-0 fault (SAF1)
TF IRF / stuck open fault (SOP)
Two-cells
Fault primitive
FP <Sa;Sv/ F / R >
Sa or Sv {0,1,0w0,0w1,1w0,1w1,0r0,1r1}
(operation sequence on aggressor (a) or victim (v) cell) / Functional fault model
e.g. < 0; 0 / 1 / - > (4 possible types) / state coupling fault CFst
e.g. < 0w1; 0 / 1 / - >
or < 1r1; 1 / 0 / - > (12 possible types) / disturb coupling fault CFds
e.g. < 0; 1w0 / 1 / - > (4 possible types) / transition coupling fault CFtr
e.g. < 1; 1w1 / 0 / - > (4 possible types) / write disturb coupling fault CFwd
e.g. < 1; 0r0 / 1 / 1 > (4 possible types) / read disturb coupling fault CFrd
e.g. < 0; 0r0 / 0 / 1 > (4 possible types) / incorrect read coupling fault CFir
e.g. < 0; 0r0 / 1 / 0 > (4 possible types) / deceptive read disturb coupling fault CFdrd

The dynamic faults strongly depend on the stresses (i.e. environmental conditions as temperature, voltage, speed and used address order, address direction and data background) and operation sequences in their detection; as proved by experimental analyses done on the new memory technologies the dynamic faults can be found in the absence of the static faults (Hamdioui, Al-Ars, van de Goor Wadsworth, 2005).

Table 2. Dynamic fault models

Dynamic fault models
(more than one operation applied to sensitise a fault)
Single-cell
(two operations applied sequentially to a single-cell)
Fault primitive < S / F / R >
S = xwyry (only these S are considered), x, y {0,1} / Functional fault model
e.g. < 0w1r1 / 0 / 0 > (4 possible types) / dynamic read disturb fault dRDF
e.g. < 0w1r1 / 1 / 0 > (4 possible types) / dynamic incorrect read fault dIRF
e.g. < 0w1r1 / 0 / 1 > (4 possible types) / dynamic deceptive read disturb fault dDRDF
Two-cells
(two operations applied sequentially to two cells, a and v)
Fault primitive <S / F / R >
Saa = Sa; Sv (Sa = xwyry; Sv describes the state of v)
Svv = Sa; Sv (Sa describes the state of a; Sv = xwyry)
 () means up (down) transition / Functional fault model
e.g. < 0w1r1; 0 /  / - > (8 possible types) / dynamic disturb coupling fault dCFds
e.g. 0; 0w1r1 /  / 0 > (8 possible types) / dynamic read disturb coupling fault dCFrd
e.g. 1; 0w1r1 / 1 / 0 > (8 possible types) / dynamic incorrect read coupling fault dCFir
e.g. 0; 0w1r1 /  / 1 > (8 possible types) / dynamic deceptive read disturb coupling fault dCFdrd

A memory decoder circuit selects a specific row (by appropriate word line) and specific column (appropriate bit line is controlled to the sense amplifier circuit)according to the required cell address.Failures in the address decoder circuit lead to static address decoder faults(AF), i.e. no memory cell is accessed with a certain address (the cell is physically inaccessible due to a stuck-at fault in the address logic or due to an open in the path to the memory location) or no address can access a certain cell (due to a stuck-at fault) or with a particular single address, more than one cell are accessed simultaneously or several different addresses access the same memory cell (due to a short in the path to the memory locations). An effective test must ensure that each read or write operation accesses one memory cell selected by the given address (van de Goor, 1998). Excessive delays in the address decoder paths that are mainly caused by opens result in dynamic address decoder delay faults (ADF). Activation delay fault and deactivation delay fault on the relevant word line (depending on the open location) may lead to incorrect operations at the selected addresses (Hamdioui, Al-Ars & van de Goor, 2006).

Another class of faults which consist of two or more simple faultsis called linked faults (LF).That means that the behaviour of a certain fault influences the behaviour of another one in such the way that fault masking can occur;therefore their testing is more complicated than testing of simple faults (Hamdioui, Al-Ars, van de Goor Rodgers,2004, Benso, Bosio, Di Carlo, Di Natale & Prinetto, 2006,Harutunyan, Vardanian & Zorian, 2006).

Knowledge of the precise set of faults is requisite for developing an optimal memory test with high fault coverage and low test time.

Test Algorithms

A memory test can be proposed based on the assumption that every memory cell must store both logical values 0 and 1, and return the stored data when it is read. Thus, the memory test is a sequence of write and read operations applied to each cell of a memory cell array although test patterns are needed to test not only the memory cell array but the peripheral circuitry around the cells as well. Memory testing is a defect-based and algorithmic procedure(Adams, 2003). The test algorithms are characterised by the test length, i.e. the number of test cycles needed to apply the test that can be easily obtained by counting the number of memory read and write operations applied to the memory cells.

Due to the sizes of currently used memories only tests with complexity (test length) directly proportional to N(N is the number of the memory addresses) are of practical use; among them march tests proposed for the defined fault models became a dominant type. A march test algorithm consists of a sequence of march test elements. A march test element is a sequence of different combinations of write logical 0/logical 1 (w0, w1) and read expected logical 0/logical 1 (r0, r1) operations applied consecutively to all cells in the memory array. All operations of each march element have to be done before proceeding to the next address.The addresses of the cells are determined either in increasing ()or decreasing () memory address order, or the address order is irrelevant (). After applying one march element to each cell in the memory array, the next march element of the march test algorithm is taken. It is required thatthe increasing and decreasing address orders are always inverse of each otherbyperforming one march test algorithm(van de Goor, 1998). The test length of a march test algorithm is linear with the number of memory address locations as itis defined as the number of write and read operations in all march elements multiplied by the number of memory address locations (or it is directly the number of memory cells if the memory width is one bit). The selected developed march algorithms and their coverage of the static and dynamic faults are summarised in Table 3 and Table 4, respectively(van de Goor, 1998, van de Goor & Al-Ars, 2000, Adams, 2003, Hamdioui, Van de Goor & Rogers, 2003, Dilillo, Girard, Pravossoudovitch,Virazel, Borri Hage-Hassan, 2004, Hamdioui, Al-Ars, van de Goor Rodgers, 2004, Azimane, Majhi, Eichenberger & Ruiz, 2005, Harutunyan, Vardanian & Zorian, 2006, Harutunyan, Vardanian Zorian, 2007, Dilillo & Al-Hashimi, 2007, van de Goor, Hamdioui,GaydadjievAl-Ars, 2009).

The fault coverage of the considered fault models can be mathematically proven;the test time is linear with the size of the memory, so march tests are acceptable also by industry, although the correlation between the fault models and the defects in real chips is not always known.The larger sets of test algorithms targeted at various static and dynamic faults in the memory cell array, address decoders or peripheral circuits have been evaluated industrially by applying them to a huge number of advanced memories (e.g. 131 KB eSRAM in 65 nm technology, 256 KB and 512 eSRAM in 13 nm, 1 MB DRAM). Each test set was applied using algorithmic stresses (specifying the way the test algorithm is performed, e.g. address directions, data background) and non-algorithmic environmental stresses as supply voltage, temperature, clock frequency (Schanstra & van de Goor, 1999, van de Goor, 2004).

In the last years also march-like algorithms were proposed that consist of three phases and are capable to perform not only fault detection but also diagnosis and fault localization (Li, Cheng, Huang & Wu, 2001, Harutunyan, Vardanian & Zorian, 2008). The identification of the failure location in a memory component (memory cell array, write drivers, sense amplifiers, address decoders and pre-charge circuits) for repair purposes (the correct use of redundancies) is more important than the information on the fault type; however, in the case of the coupling faults the testingand/or diagnosis algorithm should be capable to locate also the corresponding aggressor cells, not only the faulty victim cells (Thakur, Parekhji & Chandorkar, 2006, Ney, Bosio, Dilillo, Girard, Pravossoudovitch, Virazel & Bastian, 2008).

Table 3. March test algorithms covering static faults

Test / Length / March elements / Static fault coverage
MATS+ / 5N / {(w0); (r0,w1); (r1,w0)} / SAF, RDF, IRF
MATS++ / 6N / {(w0); (r0,w1); (r1,w0,r0)} / SAF, TF, RDF, IRF, AF
March X / 6N / {(w0); (r0,w1); (r1,w0); (r0)} / SAF, TF, AF, some CFs
March Y / 8N / {(w0); (r0,w1,r1); (r1,w0,r0); (r0)} / SAF, TF, AF, some CFs, some linked TFs
March C- / 10N / {(w0); (r0,w1); (r1,w0); (r0,w1); (r1,w0); (r0)} / SAF, TF, RDF, IRF, AF, CFs
IFA-9 / 12N
+ delays / {(w0); (r0,w1); (r1,w0); (r0,w1); (r1,w0); delay; (r0,w1); delay; (r1)} / SAF, TF, AF, some CFs, DRF
March LR / 14N / {(w0); (r0,w1); (r1,w0,r0,w1); (r1,w0); (r0,w1,r1,w0); (r0)} / also LFs
Marching 1/0 / 14N / {(w0); (r0,w1,r1); (r1,w0,r0); (w1); (r1,w0,r0); (r0,w1,r1)} / SAF, AF
March SR / 14N / {(w0); (r0,w1,r1,w0); (r0,r0); (w1); (r1,w0,r0,w1); (r1,r1)} / SAF, TF, CFs, IRF,RDF, DRDF, SOF
March SRD / 14N
+ delays / {(w0); (r0,w1,r1,w0); delay; (r0,r0); (w1);
 (r1,w0,r0,w1); delay; (r1,r1)} / SAF, TF, CFs, DRDF, SOF, DRF
March A / 15N / {(w0); (r0,w1,w0,w1); (r1,w0,w1); (r1,w0,w1,w0); (r0,w1,w0)} / SAF, TF, AF, some CF, some linked CFs
IFA-13 / 16N
+ delays / {(w0); (r0,w1,r1); (r1,w0,r0); (r0,w1,r1); (r1,w0,r0); delay; (r0,w1); delay; (r1)} / SAF, TF, IRF, RDF, AF, intra-word CF, SOF, DRF
March B / 17N / {(w0); (r0,w1,r1,w0,r0,w1); (r1,w0,w1); (r1,w0,w1,w0); (r0,w1,w0)} / SAF, TF, SOF, IRF, RDF, AF, some linked TFs
Enhanced March C- / 18N / {(w0); (r0,w1,r1,w1); (r1,w0,r0,w0); (r0,w1,r1,w1); (r1,w0,r0,w0) ; (r0)} / SAF, TF, AF, CFs, precharge defects
March SS / 22N / {(w0); (r0,r0,w0,r0,w1); (r1,r1,w1,r1,w0);
( r0,r0,w0,r0,w1); ( r1,r1,w1,r1,w0); (r0)} / all static faults
March G / 23N
+ delays / {(w0); (r0,w1,r1,w0,r0,w1); (r1,w0,w1); (r1,w0,w1,w0); (r0,w1,w0); (r0); delay; (r0,w1,r1); delay; (r1,w0,r0)} / SOF, DRF
March MSL / 23N / {(w0); (r0,w1,w1,r1,r1,w0); (r0,w0);  (r0); (r0,w1); (r1,w0,w0,r0,r0,w1); (r1,w1);
 (r1); (r1,w0)} / single-, two- and three-cellsLFs
March SL / 41N / {(w0); (r0,r0,w1,w1,r1,r1,w0,w0,r0,w1); (r1,r1,w0,w0,r0,r0,w1,w1,r1,w0); (r0,r0,w1,w1,r1,r1,w0,w0,r0,w1); (r1,r1,w0,w0,r0,r0,w1,w1,r1,w0);} / all single-port static LFs

Tools and a methodology have been introduced to automatically generate march test algorithms based on the defined memory fault models that are proven to have complete coverage for particular fault models and minimal test length for the given set of faults (Benso, Bosio, Di Carlo, Di Natale & Prinetto, 2005, Benso, Bosio, Di Carlo, Di Natale & Prinetto, 2008).