Numerical Error: Atkinson defines absolute error as follows -
AE = error = xt - xa,
where xa approximates some exact value xt, and the relative error as
RE = relative error = (xt - xa)/xt.
We will also have occasion to use these logarithmic measures of error:
d-acc = “decimal places of accuracy”= - log10|error|
-acc = “digits of accuracy” = - log10|RE|
Let’s try and justify the d-acc & -acc labeling...
Suppose xa is accurate to at least the dth decimal place (i.e. |AE| 10-d),
where d is an integer. Then d-acc d, and conversely. In particular, since
d-acc [d-acc], with [ ] denoting the ‘greatest integer function’, it holds
that xa is accurate to at least [d-acc] decimal places.
If |AE| 10-d, then since |xt| 10e, |RE| 10-(e+d)-acc e+d.
Here, we’re defining xa to have at least k digits of accuracy*with respect to xt
if the magnitude of the error, |AE|, is less than or equal to 1 unit in the (k+1)st
digitcounting rightward from xt’s first nonzero digit. More precisely,
|AE| 10e-k, for e satisfying |xt| = 10e(d1.d2d3...dk...) with d1 1.
Note: This condition also implies |RE| 10-kso that -acc k.
*This is more stringent than Atkinson’s definition, i.e.1 unit vs. 5 units.
Claim: If -acc , then xa has at least-1 digits of accuracy wrt xt.
Proof: Suppose that -acc . Then -log10|RE| |RE| 10-
|AE| |xt| 10-< 10e-(-1) andso xa has the claimed accuracy.
Note: We can’t do better than - 1digits. Try xt = 5.000 and xa = 4.995.
Then |AE| = .5 10-2and-acc = = 3, but |AE|10-3 fails to hold.
Claim: Suppose xa is accurate wrt xt to at least the (st place counting
rightward from xt’s first nonzero digit, i.e. assume |AE| 10e-,
where |xt| = (d1.d2d3...) 10e with d11. Then -acc .
Proof: Since |RE| = |AE|/|xt| 10-, this follows immediately.
Atkinson’s #7 on pg. 43, but using the round-to-even rule & an arbitrary :
Claim: For x in the range of the arithmetic, that is if ßL |x| ßU( – ß1-n),
|(x - fl(x))/x| .5ß1-n/(1 + .5ß1-n) < .5ß1-n,
where “fl” is the round-to-even floating point operation.
Proof: Letx = sgn(x) ße (d1.d2d3...dndn+1...)ß,z = (d1.d2d3...dn)ß and
y = (.dn+1dn+2dn+3…)ß.
Then we know that fl(x) (and note that fl(-x) = -fl(x)) is either...
(1) sgn(x) ßez + sgn(x) ße ß1-n if y > .5 or if y = .5 & dn is odd,
or
(2) sgn(x) ßez if y < .5 or if y = .5 and dn is even.
Therefore,
(1) |x - fl(x)| = ßeß1-n(1 - y) and (2) |x - fl(x)| = ße ß1-ny.
Accordingly, in case (1),
|(x - fl(x))/x| ße ß1-n(1 - y)/(ße(1 + ß1-ny))
.5 ß1-n/(1 + .5 ß1-n)
< .5 ß1-n,
while in case (2),
|(x - fl(x))/x| ß1-ny/(1 + ß1-ny) .5ß1-n/(1 + .5ß1-n) < .5 ß1-n.
*Since g(y) = ß1-n y/(1 + ß1-n y) is strictly increasing for y > 0.
Rounding-up-fives (aka simple rounding) yields the same inequality while
chopping yields |x - fl(x)| = ße ß1-n y < ße+1-n therefore |(x - fl(x))/x| < ß1-n.
Regarding Addition: Suppose we want to compute S = 1in ai, where we
can ignore the input errors occurring when the ai's are floated, but not the
errors associated with the intermediate sums, i.e. the values
a1 + a2,a1 + a2 + a3, a1 + a2 + a3 + ... + an.
Set S2 = fl(a1 + a2) = (a1 + a2)(1 + 2), S3 = fl(S2 + a3) = (S2 + a3)(1 + 3),...,
Sn = fl(Sn-1 + an) = (Sn-1 + an)(1 + n). Then since
Sn = (a1 + a2) 2in (1 + i) + a33in (1 + i) + a44in (1 + i) +
…+ an-1 (1 + n-1) (1 + n) + an (1 + n)
= (a1 + a2) (1 + 2ini+ r2) + a3 (1 + 3ini + r3) +
…+ an-1 (1 + n-1+ n+rn) + an (1 + n),
where r2, r3, r4,…, are higher power expressions in 2, 3, …, it follows
thatif these expressions are ignored,Sn– S is approximately equal to
(a1 + a2) 2ini + a33ini +… + an-1(n-1+ n) + ann.
This suggests thatwe might minimizethe error if we first sort the ai's from
least to greatest in magnitude, i.e. |a1| |a2| |a3| . . . |an|, before adding.
Example: Add 1.21, 10.4, 30.4, 50.6, & 70.1. The exact value is 162.71,
but using 3-digit arithmetic with a round-to-even, yields the following -
least to greatest: 163. vs. greatest to least: 162.
rel. error = .00178... rel. error = .00436...