A Simplified Approach to Fault Tolerant State Machine
Design for Single Event Upsets
Melanie Berg
Principle Design Engineer, Ball Aerospace & Technologies Corp.
1ABSTRACT
As Integrated Circuit (IC) geometries become smaller and core voltages scale down, the probability of incurring system faults increases significantly. Errors occur when charged particles penetrate a memory cell and cross a junction, creating an aberrant charge that changes the state of the bit. Due to the complexity and sensitivity of large designs, fault tolerant schemes are evolving at the gate level.
This paper will address Single Event Upsets (SEUs) within edge-triggered D-Flip-Flops (DFFs) and assumes that the upsets are soft (correctable by the following clock edge … if the DFF is enabled). Solely considering single events is a fair assessment due to the low probability of having multiple errors occur within one clock cycle. Due to the radiation effects in space, the Aerospace industry has always had to design with SEU consideration. As far as gate-level DFF protection is concerned, Triple Mode Redundant (TMR – voting) logic is the most commonly used scheme to combat SEUs. However, TMR can be very area extensive and - in a turbulent environment – may not fully erase the probability of upsets. As a solution, many error-coding techniques have been proposed as a compliment (or replacement) to TMR, however due to their complexity, they are rarely implemented.
A simplified approach to fault tolerant state machine design starting from architectural development through synthesis will also be discussed. Examples of coding schemes that include additional logic for error detection and (in some cases) correction such as One Hot, Sequential, and Hamming will be examined. Due to the fact that users have run into roadblocks with synthesis tools “optimizing” away necessary logic for error handling, special attention will be given to the synthesis tools concerning the necessary techniques involved in producing the correct realization of functionality.
2Fault Tolerance
The definition of Fault Tolerance is the ability to mask or recover from erroneous conditions in a system once an error has been detected.
2.1Determining the Level of Fault Tolerance for Your System
The degree of fault tolerance implementation is defined by your system level requirements… i.e. specifications that clearly state acceptable behavior upon error. The following are some questions that must be answered within the system requirements documentation:
- Does your system only need to detect an error?
- How quickly must the system respond to an error?
- Must your system also correct the error?
- Is the system susceptible to more than one error per clock cycle?
2.2Common Approach: TMR (Triple Mode Redundancy)
Figure 1: TMR Implementation Diagram
Triple Mode Redundancy (TMR) is the most commonly implemented solution of SEU tolerance because it is a very simple solution. One needs to be cautious when implementing mitigation. It is important that the voting logic is glitch free. Otherwise, the glitchy TMR voting logic (which can occur due to mitigation across separate clock domains or hazardous combinational logic) can be caught by a clock edge. In low frequency designs, the probability of mitigation glitches being caught by clock edges is low. However, as we increase the frequency, so does the probability. Probability of glitch capture is based on clock speed, fan-out, and the width of the glitch (speed of the gate logic within the voting circuitry).
The following figures demonstrate a poor implementation of TMR using a 32-bit counter. A counter was chosen for this example due to the large fanout of the circuitry.
Figure 2: 32 Bit counter with TMR
Figure 3: Waveform Demonstrating the Effects of Glithes in the TMR Logic
3Synchronous Design with Asynchronous Events
It is important to note that SEUs are asynchronous. It is impossible to control when (or where) the event will occur. This is a point that is overwhelmingly overlooked.
Because this paper focuses on sequential Single Event Upsets (SEUs) within a synchronous design environment, it is important to discuss the affects of an asynchronous event (in which every internal DFF is susceptible) within a synchronous design environment. If the event occurs well within the clock period, then there may be enough time to deterministically correct the error (depending on the scheme and clock speed). However, if the SEU occurs near a clock edge, Metastability (or unpredictable events) can occur. Current mitigation schemes do not take this into account and thus decreases the anticipated reliability.
4MotivationforState Machine Fault Tolerance
DFFs play an important role within synchronous designs. They are contained within one clock domain and act as deterministic timing boundaries. Such a design strategy (synchronous techniques) increases the verification coverage of a circuit and enhances production.
Synchronous state machines use DFFs to hold its current state value. State Machines are generally used as controllers and are at the heart of most designs. If a synchronous state machine is not designed to accommodate Single Event Upsets, the circuit can become locked or produce unpredictable behavior until a reset is generated. Unfortunately, waiting for a system reset may not be a suitable solution. Thus, designers should be aware of techniques for error detection and perhaps correction specifically for erroneous state conditions.
It is important to note that current synthesis tools are specifically geared towards area and timing optimization. Their algorithms will want to “erase” redundant schemes for fault tolerance that the designer places within the HDL code. Therefore, the designer must also be familiar with the synthesis package of choice and apply the necessary directives for preserving redundant logic.
5Synchronous State Machine Implementation
Generally, state machines are utilized as controlling mechanisms within a design. They determine when signals should turn on or off, when to implement (or stop implementing) a function, how long to wait for events, etc…. However, state machines are not necessary to implement such functionality. As a matter of fact, using state machines has the tendency to increase the number of total gates within a design. In addition, if operating a state machine in a disruptive environment (such as a space mission), the entire system can lock up into an unreachable condition. So why use state machines? Utilization of state machines afford the designer: an easy to follow design methodology, manageable design reviews, and a hook into a systematic verification process. It also alleviates the propensity of creating “spaghetti “(out of control) designs.
5.1Structure
A synchronous state machine is designed to deterministically transition through a patternof defined states. A state is represented by a register (set of DFFs) and is referred to as the “current state.”
Figure 4: Traditional State Machine Logic Diagram
The structure consistsof four parts:
- Inputs: All inputs must be synchronous to a “synchronous” state machine. Inputs are used to determine next state transitions.
- Next state logic: Combinatorial logic used to generate the next stage (state) for the machine. The next state value is a function of the state machine’s inputs and its current state.
- CurrentState Register: Register of n-bit DFFs used to hold the current state of the machine. Synchronous state machines change state only by a clock edge.
- Output logic: The output can be purely combinatorial or registered (generally it ispreferable to register the logic). The outputs can be a function of the next state and/or current state (and perhaps the direct inputs) of the machine.
5.2Encoding Schemes
Each state of a state machine must be mapped into some type of encoding (pattern of bits). Once the state has been mapped, it is then considered a defined (legal) state. Depending on the encoding scheme and the number of states within the design, there can be “unused” encoded patterns – i.e. there is no defined mapping from the encoded pattern to a state. In such a case, if there is a fault within the circuitry and the state machine jumps into one of these unmapped (illegal) states, then the circuit has the potential to become locked (or stuck) in some unreachable condition (undefined state).
The current most popular encoding schemes used by designers are: One Hot, Binary, and Gray. The reader should be aware that there are many other schemes; however, this paper will only address Binary and One Hot – plus a special encoding technique specifically for correction.
Figure 5: Binary and One Hot Encoding Schemes - Mapped States vs. Unmapped States
5.2.1Binary
The Binary Encoding technique maps states into a base 2 counting scheme. (Log2(NStates))registers are required – i.e. for 5 states (NStates), 3 registers (NReg) are required. There will always be 2NReg – Nstates unmapped states.
Because 2NReg – Nstates will always be less than NStates (theory of binary of encoding), the probability of flipping into a mapped state is higher than flipping into an unmapped state (IDLE or error state was not including in this deduction). Plainly speaking, it is more likely to jump into “any” of the mapped states than the IDLE state (the word “any” brings the non-deterministic nature to this problem).
Figure 6: SEU and Binary Encoding Response
5.2.2One-Hot
The One-Hot encoding requires only one bit to be turned on at a time, i.e. each state is mapped to one bit. One Hot state-machines require more registers than Binary or Gray thus ultimately have more “illegal states”. However, it takes two bits to flip in order to transition into a mapped state. This makes detecting errors within One-Hot easy. If implemented correctly, this can make transitioning from an unmapped state deterministic.
5.2.3Binary vs. One Hot
There are many debates over which technique is best for fault detection implementation: one-hot vs. sequential. The two styles differ such that if a bit has flipped (a fault) the binary value of the one-hot encoding will no longer be “One-Hot” (more than one bit will be a ‘1’ or all bits will be ‘0’). Thus the error detection lies within the binary representation.
With Binary Encoding this is not true. The reasoning is that a flipped bit (a fault) can represent a “legal” state. Therefore, combinatorial error detection will not suffice – the logic needs memory to hold the parity of the good state.
Some designers believe that the use of the extra parity bit (extra DFF within the state machine) can be beneficial for fault detection in sequential machines. This is a false statement. There are many scenarios that can be presented that show the implications of assuming a level of fault detection while implementing sequential encoding and parity. Most of them are based on the premise that multiple paths within the circuitry have different routing delays and if an SEU happens near a clock edge, unpredictable results can occur. The obvious problem is that an incorrect transition can occur without any detection. This is a phenomenon of a “hamming distance of one” encoding scheme such as the sequential state machine. Employing a hamming-1 encode is not a recommended practice while implementing fault detection and correction.
6Fault Tolerant State Machine Implementation
The main objective of a fault tolerant state machine is to be able to detect an error (a flipped bit in the current state registers) and have a deterministic response within a deterministic time frame. The definition of the response and its response time is dependent on the design requirements. For example, the response may be as simple as indicating that an error occurred and waiting for a signal to bring the state machine to a known “working” state… or the response may be as complex as automatic correction of the error within one clock cycle.
6.1State Machines and the Bucket Approach
Most designers believe that they are implementing a bucket-approach to fault tolerance when using the default clause (VHDL- when others) within a CASE statement. However, the synthesis tools ignore these statements when synthesizing state machines. Most designers are not aware that these statements (that suggest “bucketing” unused states and supplying a transition out of a fault condition) have no bearing on the actual gates being created. The reasoning behind this is that if the synthesis tool were to implement all unused states, the required area would be excessive.
Synthesis tools have a directive that the designer can use called “safe”. When applying this directive to a state machine, the tool can produce a bucket of illegal states. However, the designer must be forewarned that using the “safe” directive has many flaws:
1. It generally creates more gates than desired, i.e. buckets of illegal states can become very large.
2. Synplify will implement a “safe” one-hot but Leonardo and Precision will only implement a sequential state machine (binary or Gray encoding) while using the “safe” directive.
6.2How Safe is the “SAFE” Option?
Recently, there has been a surge of designers using the “safe” options offered by several different synthesis tools. They have defined a safe state machine as one that will always transition to a known state - i.e., if an SEU occurs and an illegal or unmapped state is reached, one will recover to a known state.
Synthesis tools offer a “Safe” option (demand from the Aerospace industry):
TYPE states IS ( IDLE, GET_DATA, PROCESS_DATA, SEND_DATA, BAD_DATA );
SIGNAL current_state, next_state : states;
attribute SAFE_FSM: Boolean;
attribute SAFE_FSM of states: type is true;
Although this scheme may appear to be “fool-proof”, it is not. The “safe” option will not always transition to a reliable (or known) state. Some have defined the word “known” as mapped. However, the idea is to be deterministic. The idea is to always flip into some “ERROR” or “IDLE” state upon getting an SEU. Flipping into a mapped state, in which the rest of the system may not know the state machine is there, is not deterministic. It is not good enough to transition to a mapped state:
- The machine can get stuck waiting for an input that will never occur (rest of the system doesn’t know that you have flipped into this mapped state).
- Unexpected signals can turn on (or off).
- And many more reasons…
The compilers are geared towards implementing the “safe” option with Binary Encoding. In such a case, there exists a false sense of safety. For example: if an SEU transitions the state machine (i.e. one of the registers gets hit and flips) into a mapped (legal) state (however it is an illegal transition to that state) an extreme fault can occur – there can be severe output behavior with no error detection.
Most importantly, note that in a Binary Encoded State Machine there will always be a higher probability of flipping into a mapped state (“good”) state, than there is flipping into an unmapped state.
Unfortunately, Binary Encoded state machines can be faulty. The designer must consider the list of potential problems before using the safe directive as the method for fault detection.
An alternate approach beyond using the synthesis “safe” directive should be taken to ensure trust worthy fault tolerance. This paper will first address single-bit error detection techniques. Afterwards, a robust single-bit error correction scheme along with a new encoding technique will be presented.
6.3Single-Bit Error Detection and Recovery
This section will focus on detecting that an error has occurred within one of the state machine DFFs and then recovering from the fault. Remember that it is necessary for the designer to manually provide recovery logic.
6.3.1One-Hot
The beauty of the one hot encoding style stems from the fact that each state has a hamming distance of 2 (it takes 2 transitions to get from one state to another) thereby it inherently has SEU error detection. During normal operation of a one hot state machine, only one bit is turned on – indicating which state is current. Thus the current state register should always have an odd parity. Each transition requires 2 bits to flip (one bit turns off while one bit turns on). If there is an SEU within the state machine, only one bit will flip. The parity will then switch to even.
Example: Using the “five state” example, assume the circuit is in GET_DATA (“00010” - odd). AN SEU can cause the state machine to have the encoded pattern of “00000”. However, “00000” is not mapped to any state and is easily detected because it has even parity. It is thus considered an undefined or illegal state for the state machine. The same theory can be used if the state machine was in state GET_DATA and the current state flip flops changed to: “10010”.