TO: Whom it may concern
FROM: Thomas R. Nicely (current e-mail address)
RE: Pentium FDIV flaw
DATE: 0900 GMT 19 August 2011
Freeware copyright (c) 2011 Thomas R. Nicely. Released into the public
domain by the author, who disclaims any legal liability arising from
its use.
Enumerated below are several questions that have frequently been posed
to me, regarding the discovery, nature, and implications of the
Pentium FDIV flaw. Each question is followed by my response.
Many of these questions were submitted by Dr. Denis Delbecq of the
Paris based computer periodical "Science et Vie Micro."
/*************************************************************/
Q1: How can a user check a Pentium machine for the presence of the
bug?
/**************************************************************/
Perform Coe's calculation (see Question 5 below). That is, carry
out the following division problem:
4195835.0/3145727.0 = 1.333 820 449 136 241 002 5 (Correct value)
4195835.0/3145727.0 = 1.333 739 068 902 037 589 4 (Flawed Pentium)
The consequence of the flaw can be made more glaring by performing
the following related calculation, which is the one employed in
the test code provided in pentbug.zip.
4195835.0 - 3145727.0*(4195835.0/3145727.0) = 0 (Correct value)
4195835.0 - 3145727.0*(4195835.0/3145727.0) = 256 (Flawed Pentium)
The calculation can be done in BASIC, in a spreadsheet (such as
Quattro Pro, Excel, or Microsoft Works), in the Microsoft Windows
calculator, or in some other programming language such as Pascal,
C, or Fortran.
Make sure that the FPU has not been disabled (this usually has to
be done intentionally through some specific action). GW-BASIC
and QBasic usually ignore the FPU. If you compile your code, turn
off all optimization.
I have provided a C source code, and corresponding DOS executable,
for the purpose of testing for the bug; see pentbug.zip.
/*************************************************************/
Q2: Could you summarize how you discovered the problem? Were you
doing research calculations or were you studying the problem of
accuracy with computers?
/**************************************************************/
RESPONSE: I was pursuing a research project in an area of pure
mathematics known as computational number theory. Specifically, I
have written a code which enumerates the primes, twin primes, prime
triplets, and prime quadruplets for all positive integers up to an
extremely large upper bound (currently 6.4*10^15). The totals are
written to a file at intervals of 10^10 (earlier 10^9). Also computed
are the sums of the reciprocals of the twin primes, the triplets, and
the quadruplets; each of these can be proved to converge to a limit,
but the limit of the sum of the reciprocals of the twin primes is
known imprecisely, and the others have not been previously
computed. Large gaps between consecutive primes have also been recorded.
Some of these results have been published in journal papers;
many others are being published and updated at my Web site,
http://www.trnicely.net,
where additional details and information are available.
The code is written so that the computation can be distributed
asynchronously over a large number of independent systems, with
the final results combined upon completion. The calculation has run
for more than 13 years simultaneously on a number of systems (varying
from a few to more than two dozen, mostly Pentiums but with a few
486s and 386s); the first Pentium was added in March, 1994. As of
September, 2006, calculations have been completed to 6.4*10^15;
the latest version of the code, ported to GNU C, has a throughput
(for intervals near 6.4*10^15) of approximately 50 million integers
per second on a 3.0 GHz Pentium 4 (800 MHz FSB, 533 MHz DDR memory)
running under GNU/Linux.
Simultaneously with the calculation of the unknown quantities, a
number of checks are maintained by calculating previously published
values (such as pi(x), the number of primes <= x). As an additional
check, the reciprocal sums are computed by two different methods.
First, reciprocals are computed using the Intel x87 floating-point
co-processor unit (FPU, NPX), which provides 80-bit registers and a
64-bit significand, equivalent to 19S (19 significant decimal digits);
this is also referred to as extended precision or long double precision
(on other platforms, long double may represent a different accuracy,
e.g., 53 bits [16D] or 113 bits [33D]), or as an "extended real"
or "temporary real" data type. Secondly, the reciprocals are
calculated to 53 (earlier 26) decimal places (53D) using scaled
ultraprecision integer arithmetic; this is accomplished using a
modification of the BIGINT code written and contributed to the
public domain by Arjen K. Lenstra (with additions by Mark Manasse,
Marc Ringuette, and Mark Riordan) circa 1988-1991. Lenstra's
C code represents very large integers by arrays of smaller (typically
32-bit) integers; it also retains some minor dependency on
floating-point arithmetic. Lenstra's code is a precursor of, and
alternative to the GMP (GNU multiple precision) library; eventually I
may replace my calls to Lenstra's code with calls to GMP.
Calculations began in early 1993. On 13 June 1994, the outputs of several
runs were assembled, and I found that the computed value for pi(2*10^13)
disagreed with the published value. This led to a long search for
logic errors and sources of reduced precision in my source code (some
3000 lines in all). In the process, I found that the Borland C++ 4.02
compiler was producing erroneous code when compiled in 32-bit mode with
certain optimizations (-Op -Om -Og) enabled. For some time I
believed this to be the source of my woes.
After eliminating this source of error, and rewriting the
code to convert certain floating-point calculations from double
precision to long double precision, I put the revised code into use on
10 September. To my dismay, I soon discovered (on 4 October) that I was
now encountering a new error, a discrepancy in the long double (sum of
the) floating-point reciprocals returned by the x87 FPU. The results for
the first trillion, as computed on the Pentium-60, differed from the
results obtained on a 486DX-33 by an amount orders of magnitude in excess
of that expected from rounding or truncation error accumulation
(the floating-point and ultraprecision sums also differed, but
by an amount less than the expected floating-point noise). Through
trial and error and finally a binary search, the discrepancy was
isolated to the twin-prime pair (824633702441, 824633702443), which
was producing incorrect floating-point reciprocals (the ultraprecision
reciprocals were also in error, by a lesser amount, evidently due to
the incidental dependency on floating-point arithmetic in Lenstra's
original integer arithmetic code).
My first conjecture was that the error was again an artifact of the
Borland compiler, but even completely disabling optimization failed
to eliminate the problem. Tracing the source of the error was
further complicated by the fact that on one occasion I tested the
code with the Pentium FPU locked out, and the error was still
present (this never happened again, and was apparently due to my
own failure to properly disable the FPU). The problem might also be
in the PCI bus on the Pentiums, rather than the CPU. After all, a
number of Pentium PCI systems had been reported in the trade press as
corrupting data due to faulty design of the interface with the PCI
bus (this was especially true of Intel motherboards using the
Neptune chipset).
The final pieces of the puzzle fell into place during the week of
16-22 October. On 17 October I gained access to a second Pentium,
which had a motherboard from a different manufacturer. The error
was present in this machine as well. During 17-19 October, I
reproduced the error in a code written in Power Basic, eliminating
the C compiler as a cause. I reproduced the error in a Quattro Pro
spreadsheet, and also verified that the error disappeared when the
FPU was locked out in real-mode DOS (this is difficult to do in
Windows code or 32-bit code, which I was using for my main
application). On 21 October, I ran the test code on a 486DX2-66
with a PCI bus; when no error appeared, I felt that the PCI bus had
been eliminated as a cause. On 22 October, I tested the code on
still a third Pentium on display at Staples, a local office supply
store; this Packard-Bell machine also produced the error. I was
now certain that the error was in the FPU of the Pentium chip.
On or about 19 October, I contacted tech support at Micron, Inc.,
from whom I purchased my system, but they were unable to provide me
with any information regarding the problem. On 24 October, I
contacted Intel tech support. After six days, they still had no
answer to the problem, only an informal acknowledgement that the error
had been reproduced there on a 66-MHz Pentium. On 26 October, I mailed
a copy of the bug demonstration codes to Mr. Tim Wetzel at Micron tech
support (no reply was ever received). On 27 October, I provided a
colleague with a copy of the test code; her husband is an engineer in
the nuclear reactor group at the local firm of Babcock and Wilcox.
A. B. Copsey of Babcock and Wilcox Nuclear Technologies reported to me
on 28 October that their new P90 Gateway Pentiums all appeared to have
the bug (this was the first e-mail exchanged in regard to the bug).
In the absence of any meaningful response from Intel or Micron,
on 30 October I sent e-mail (see the file
bugmail1.html)
to a number of individuals and organizations whom I felt would have access
to many other Pentium systems, and asked them to check for the problem.
I believe you are aware of events from that point on.
/**************************************************************/
Q3: Can you reveal the parties to whom you addressed the initial
e-mail inquiries of 30 October 1994? Were any parties informed
of the problem prior to that date? This knowledge might be useful
in tracing the process by which the public became aware of the
problem.
/***************************************************************/
Actually, I did not maintain a definitive list of these parties.
It did not seem of any importance at the time, and I merely chose
some individuals and organizations whom I felt would be likely to
have access to numerous Pentiums and other systems, so that they
might test for the error. Following below is the chronology as
best I can now reconstruct it.
March 1993 Approximate date of beginning of calculations.
March 1994 Pentium system is added to computational group.
13 June 1994 A discrepancy (incorrect count of primes < 2*10^13) is
first noted in the research code results. Some
of my department colleagues are made aware of this.
This initial error was probably related to the FPU FPREMx
instructions through the C fmod and fmodl instructions.
June-Sept A long process attempting to pinpoint and eliminate
the error is carried out. Large parts of the code
are rewritten; the Borland compiler bugs are accounted
for; other sources of potential error are eliminated.
10 September The revised code is put into production.
4 October A new error is noticed: the FPU values for the sum of
the reciprocals of the twins, as computed on the Pentium-60
and a 486DX-33, diverge within the first 10^12. After
several days, the discrepancy is tracked down to
the twin prime pair (824633702441, 824633702443), and it
is noted that the elementary operation 1/824633702441 is
returning an incorrect value from the FPU in C++.
17 October The code is tested on a colleague's brand new Pentium,
and the same error is noted. The error does not appear
on 486s. The new Pentium has an Intel motherboard; mine
has a Micronics motherboard.
18 October The error is reproduced on the Pentium in Power Basic
and Quattro Pro, thus is not language dependent. It
disappears when the FPU is disabled.
19 October Results of 18 October are confirmed on the new Pentium
system. I inform tech support at Micron Computers of
the problem, but they have no explanation.
21 October A 486 system with a PCI bus does not show the error,
eliminating the PCI bus as the source.
22 October A third Pentium system displays the error---a Packard-Bell
system on display at Staples' office supply. It is
confirmed in the Microsoft Works spreadsheet.
24 October I call Intel tech support and inform them of the problem.
A response is promised within a few days. A colleague in
Great Britain is informed of the problem by letter (he
did not receive the letter for several days, and was
apparently unable to gain access to a Pentium to check
for the bug).
26 October I mail floppy disks containing test codes for the bug to
Micron tech support (Tim Wetzel). No response is ever
received. Also about this time, my colleague informed
Insight, Inc. tech support that his new Pentium had the
problem (with no substantive response).
27 October I give a floppy disk containing copies of the bug detection
codes to a colleague whose husband works at Babcock &
Wilcox Nuclear Technologies (Lynchburg, Virginia; later
known as BWXT/Framatome).
28 October A. B. Copsey of BWNT informs me by e-mail that their new
Gateway P90s all show the bug, using my test codes. This
is the first e-mail transmission on the subject.
30 October With no substantive response to this point from either
Micron tech support or Intel tech support, I dispatch my
initial e-mail inquiry (see the file
bugmail1.html)
to several individuals and groups (at approximately 3:20:49
pm EST). The following listing is approximate:
1 Andrew Schulman
2 Ralf Brown
3 David Maxey
4 Jim Kyle
5 Raymond J. Michaels
6 Tom Halfhill (Byte magazine)
7 Ziff-Davis Labs
8 Spencer Katt (PC Week)
9 <157.9301@mcimail.com>
10 Brett Glass (Infoworld)
11 John Dvorak (PC Magazine)
12 Robert X. Cringely (Infoworld)
The first five are the authors of "Undocumented DOS,"
2nd edition. I'm not sure who had address 9; research
by Gideon Yuval indicates that it may have been the
address for PC Magazine's "Letters to the Editor"
section. Most of the above parties never responded (of
the trade publications, only Tom Halfhill responded,
saying he would refer the inquiry to Byte's labs).
Robert X. Cringely apparently refused to use or
acknowledge the information, on the rather curious
grounds that my request for attribution constituted a
copyright, and was also unprofessional. So far as I
know, only Andrew Schulman made a real effort to
investigate the problem; he forwarded the inquiry to
Richard Smith at Phar Lap, Inc. Ralf Brown also sent a
response.
31 October An Intel engineer calls and asks that a diskette with
copies of the bug detection codes be shipped to them
Fed Ex overnight. The package is sent out at about
6:30 pm EST.
1 November Richard Smith of Phar Lap posts the original inquiry
on the Canopus forum of Compuserve. For further details
on the early propagation of the inquiry message on the
Internet, see Smith's account at rsmith.html.
2 November Richard Smith informs me that the bug has been detected
on some of Phar Laps's Pentiums (others had apparently
already received replacement chips some weeks earlier).
2 November An Intel program manager calls to acknowledge receipt
of the bug codes. He says my analysis is essentially
correct, that Intel itself had noticed the problem during
its own testing (I later learned that this apparently
happened during testing of a similar FPU intended for
the P6, in May or June of 1994), and that a new stepping
with the problem fixed is out in sample quantities. He
offers to ship two replacement chips (one for my system and
one for my colleague's).
2 November I receive an inquiry from Alexander Wolfe of Electronic
Engineering Times regarding the flaw.
3 November The two replacement chips arrive.
4 November I install one of the replacement chips in my home Pentium.
First tests indicate that the bug has been fixed.
7 November Alexander Wolfe's article appears in Electronic Engineering
Times. The matter is now fully public.
21 November Steve Young, chief financial correspondent for CNN Cable
News, is the first mainstream media journalist to break
the story of the Pentium FDIV flaw and its implications
for Intel. The story is then picked up by other national
and international media.
30 November Intel releases an in-house study of the flaw, "Statistical
Analysis of Floating Point Flaw in the Pentium Processor
(1994)," H. P. Sharangpani and M. L. Barton, Intel
Corporation. This study minimizes the potential impact
of the flaw on the vast majority of users, a conclusion
with which I largely agree.
12 December IBM releases its own study of the potential impact of
the flaw, challenging Intel's analysis and concluding
that the flaw will seriously impact the work of a large
number of users both within and outside the scientific
community. My own analysis is closer to Intel's position.
20 December In response to a firestorm of public opinion, Intel
announces plans for a total recall, replacement, and
destruction of the flawed Pentium processors.
17 Jan 1995 Intel announces a pre-tax charge of 475 million dollars
against earnings, ostensibly the total cost associated
with replacement of the flawed processors.
/**************************************************************/
Q4: In which fields of mathematics and numerical models could the
FDIV roundoff error reduce significantly confidence in the
results? Many people talk about the formulas that demonstrate
the problem.
/***************************************************************/
RESPONSE: Clearly, computational number theory is one area
affected. Other areas with the potential for major difficulties
include computations in chaos theory (non-linear dynamics), linear
programming or finite element analysis (where ill-conditioned
matrices may be involved), and areas requiring numerical solution
of differential equations by iterative methods (if high precision
is required in the extrapolated result, as in orbital dynamics).
Bear in mind, however, that the likelihood is 1000 to 1000000 times
greater that any erroneous results obtained on a Pentium are due to
software errors, rather than any error in the CPU. For the average
user, I do not believe the bug has a significant impact,
particularly in comparison to other sources of error.
However, for users in mathematics, science, and engineering, we
must each be our own judge as to the danger posed by the bug. In
any case, whether you are using the Pentium or some other CPU,
mission-critical applications and those which may affect the health
and welfare of others should be performed in duplicate, preferably
on systems with different CPUs, operating systems, and application
software.
/***************************************************************/
Q5: What does this FDIV problem signify at the logical level of
the FPU? Does it occur with some specific mantissa schemes?
/***************************************************************/
RESPONSE: The difficulty apparently arises from an error in the
lookup tables used to implement the hardware division algorithm;
five cells in a lookup table were accidentally left blank. The Pentium
apparently attempts to use a much more aggressive algorithm for
hardware floating-point division than did the 486; this is
indicated by the fact that it uses only about half as many clock
cycles per floating-point division. Evidently the 486 is
attempting to generate one bit of the quotient per iteration, while
the Pentium attempts to generate two bits per iteration. According
to Coe and Intel, the critical denominators (those that might produce
a flawed division or remainder) are those that have bits 2 through 7
inclusive on in the mantissa (significand) of the 80-bit IEEE temporary
real representation (employed by Intel x87 numeric coprocessors); this
is borne out by my own experience. Thus problem denominators can be
identified by masking the most significant word of the denominator
mantissa. Only a small portion of even these mantissas produces an error.
The sign and exponent are irrelevant. The worst case error is the one
first discovered by Coe: 4195835.0/3145727.0 is returned correctly to
only 12 matching bits and 14 significant bits (the 5th decimal digit and
all beyond are in error; the flawed result is accurate to only five
significant digits; the difference in the two binary values is
zero through the 14 leading bits):
4195835.0/3145727.0 = 1.333 820 449 136 241 002 5 (Correct value)
4195835.0/3145727.0 = 1.333 739 068 902 037 589 4 (Flawed Pentium)
So far as is known, this is the worst-case error possible (in a simple
long double division x/y of floating-point numbers x and y) as a result
of the flaw. Reports of the fourth decimal digit being in error are
simply variations of the above example (e.g., multiply the numerator
by 5; the results will now differ in the fourth significant digit,
and the fifth digit [fourth one to the right of the decimal point]
is no longer significant, but the relative error is still the same,
and the number of matching and significant bits is still the same).
Note that the FPU instructions FPREM and FPREM1 (floating-point
remainders, as called by fmod in C) are also subject to the bug.
In fact, it was probably one of these that caused my original
13 June error, rather than the FDIV instruction; all these
instructions rely on the same hardware divider unit.
A more detailed analysis of the flaw can be found in the papers
cited in the bibliography at the end of this document.
/****************************************************************/
Q6: Do your calculations of the relative frequency of the error
agree with those publicized by Intel?
/****************************************************************/
RESPONSE: Yes, within an order of magnitude. Intel quotes an error
rate of about 1 in 8.77*10^9 random divisions. The exact frequency
depends on the type and precision of the operands; single-precision
reciprocals, for example, are always returned correctly.
Note, however, that many authorities consider statistical sampling
rates to be unrepresentative of the problem, since the values
appearing in a particular application may not constitute a random
sample of all possible mantissas. In particular, the analysis
publicized by IBM on 12 December 1994 claims that the numerical values
appearing in spreadsheets are heavily biased toward the bit
patterns subject to error, and that consequently the error occurs
thousands or millions of times more often in common usage than is
indicated by Intel's "White Paper" analysis. I personally regard
Intel's analysis as more realistic, if a bit optimistic (as I stated
in my San Francisco Examiner article of 18 December 1994, I would be
surprised if the average user noticed any effect from the error within
the lifetime of the chip). Aside from Intel's analysis, there is
one compelling piece of empirical evidence to support the belief that
the error is not of consequence to most users: after over a year of
worldwide use of Pentium systems, not a single one of roughly a million
users had noticed the error. Thus either the error is inconsequential
for almost all users, or almost all users are extremely sloppy in their
work. Over a period of five years at my workplace, no person was ever
able to collect a reward offered for exhibiting (other than with a code
artificially contrived to demonstrate the error), on either of two
publicly available systems intentionally left with flawed CPUs
installed, an error caused by the flaw.
The actual number of different division problems (long double operand
pairs, excluding pairs which include or produce denormals) which produce
an erroneous result appears to be roughly 3*10^37, out of a total of
2.28*10^47 possible such pairs.
/****************************************************************/
Q7: Do the replacement Pentium chips you received from Intel
appear to eliminate the bug?
/****************************************************************/
RESPONSE: Yes. I have tested the replacement chips with billions
of divisions and reciprocals involving the critical bit patterns,
and have observed zero errors. The critical cases, such as my
original example and Tim Coe's example, have also been tested
individually.
/***************************************************************/
Q8: What about the so-called "workarounds" for the bug?
/***************************************************************/
RESPONSE: The workaround finally recommended by Intel is to replace
each division operation by a function call. The function checks the
divisor for the critical bit pattern; if it is not present, the result
of a normal division is returned; if the critical pattern is found,
the numerator and denominator are each multiplied by 15/16 before
the division is performed. The factor 15/16 was determined to shift
critical bit patterns to benign ones, while it does not shift any
benign critical bit patterns to erroneous ones. The replacement
function for long double division in C might look like the following.
long double ldQuotient(long double ldNumerator, long double
ldDenominator)
{
unsigned short int ui, *uip;
uip = (unsigned short int *)(ldDenominator);
ui = *(uip + 3);
if ((ui & 0x07e0)==0x07e0)
return(((15.0L/16.0L)*ldNumerator)/((15.0L/16.0L)*ldDenominator))
else
return(ldNumerator/ldDenominator);
}
Variations are required for other precisions and for the remaindering
functions fmod and fmodl.
Of course, the workaround only succeeds in applications whose code
has been rewritten and/or recompiled, and reshipped since the bug
appeared. Updated versions of some compilers provide the developer with
the option of automatically trapping each division for the flaw (via a
compilation switch such as -fp). Previously existing binaries can avoid
the bug only by locking out the FPU (e. g., by setting 87=NO and NO87=NO87
in DOS, or by resetting the emulation bit in the machine status word of CR0
otherwise, as can be done using utilities which have been made
available by several companies, including Compaq). It is also
possible to trap the relevant instructions with a TSR or a VxD,
then check for and correct erroneous operations, but this apparently
slows the machine down almost as much as locking the FPU out.
The workaround slows the machine down slightly, perhaps 20 % (this
is application dependent). Locking out the FPU may slow the
machine down by a factor of five or ten, depending on the
application; and some applications will not function without an
active coprocessor present.
/***************************************************************/
Q9: Why do you think this particular bug has received an
inordinate amount of publicity, making it such a public relations
nightmare for Intel?
/***************************************************************/
I believe several factors contributed to this phenomenon.
* Intel's initial failure to publicize the problem, even in a
listing of errata to their OEMs and most valued customers, was
in retrospect a mistake which alienated these constituencies.
* Even more baffling, Intel failed to warn their tech support
desk to immediately report any external complaint about the
bug, so that it could be given special handling.
* Intel's subsequent response, once the bug had been detected
independently, was considered unsatisfactory by nearly
everyone outside the company.
* The Pentium CPU has been the subject of a high-profile
advertising campaign by Intel.
* In contrast to most previous errors found in CPUs, this one
occurs in an elementary, frequently-used operation which is
easy to demonstrate to the non-specialist, even those who have
little or no computer training.
* The bug was found late in the life cycle of the chip, after
millions of them were already distributed or in production.
* The existence of the Internet, and its current widespread
availability, caused the news and the reaction to Intel's
response to spread much more rapidly than for previous bugs.
Unfortunately, many of the Internet discussions generated more
heat than light.
* One of Intel's principal competitors decided it was in their
interest to publicize an estimate of the flaw's impact which
I believe to be exaggerated, and in obvious disagreement
with user experience prior to public knowledge of the flaw.
/***************************************************************/
Q10: Do you believe Intel's eventual total recall of the flawed
chips on 20 December 1994 was appropriate?
/***************************************************************/
Certainly, it was appropriate from a public relations standpoint.
My own feeling is that a great many people overestimated the
importance, impact, and peril of the flaw; for example, I consider
IBM's analysis of 12 December 1994 to be a serious exagerration of the
impact of the flaw. Intel's action, under tremendous pressure from
customers, establishes a new level of accountability in the industry.
If chip manufacturers such as Intel, IBM, and Motorola are now to be
expected to offer unconditional replacement of a chip each time a new
flaw is found, we may very well see prices and/or time to market greatly
increase. There may unfortunately be an even greater incentive for
the manufacturers to keep the discovery of flaws secret. We could even
see a two-tiered pricing system, with one price for chips "as is" and
a much higher one offering unlimited replacements.
My own feeling is that fewer than 10 % of all users needed to have any
real concern about the flaw, and probably fewer than 1 % would
actually be impacted by it. Thus, in one sense, the recall is a waste
of resources at a time when society in general can ill afford such an
extravagance; it was simply not worth more than 100 million dollars to
correct this flaw (Intel announced on 17 January 1995 a pre-tax charge of
475 million dollars against earnings, ostensibly the total cost
associated with replacement of the flawed chips).
Even more distressing is Intel's decision to destroy the flawed chips,
rather than donating them (as is, without liability, not for sale or
resale) to educational and non-profit institutions. This is an even
worse instance of waste than Apple Computer's decision some years ago
to bury the last few thousand Lisas in a landfill.
/***************************************************************/
Q11: What lessons should be learned by the general public from this
experience?
/***************************************************************/
I would hope that computers and computer analysis would lose some of
the aura of invincibility with which they have been treated. Computer
generated results need to be treated with some enlightened skepticism.
No system or microprocessor can be expected to produce results which
are absolutely reliable.
Computations which are mission critical, which might affect someone's
life or well being, should be carried out in two entirely different
ways, with the results checked against each other. While this still
will not guarantee absolute reliability, it would represent a major
advance. If two totally different platforms are not available, then
as much as possible of the calculations should be done in two or more
independent ways. Do not assume that a single computational run of
anything is going to give correct results---check your work!
At the same time, we must be conscious that the chips are one of the least
likely sources of error; user input, application software, system
software, and other system hardware are much more likely to cause errors.
This is an even better reason for running check calculations. Few users
are aware that even electromagnetic or particle flux can cause errors.
Since the Pentium flaw affair, I have encountered machine errors on more
than fifty other occasions. These were due to defective memory chips,
soft memory errors, disk subsystem malfunctions, and possibly
operating system errors. In several of these instances, I had no reason
to be suspicious of the result except that a second machine produced a
different result.
Thomas R. Nicely
Bibliography
-
Statistical analysis of floating point flaw in the Pentium Processor
(1994). H. P. Sharangpani and M. L. Barton, Intel Corporation
(30 November 1994). This is Intel's "White Paper."
- Inside the Pentium FDIV bug. Tim Coe. Dr. Dobb's Journal (April 1995)
#229, pp. 129-135 and 148.
- Computational aspects of the Pentium affair. Tim Coe, Terje Mathisen,
Cleve Moler, and Vaughan Pratt. IEEE Computational Science & Engineering
(ISSN 1070-9924, March 1995) Vol. 2, #1, pp. 18-31.
- Higher-radix division using estimates of the divisor and partial
remainder. Daniel E. Atkins. IEEE Transactions on Computers
C-17:925-935 (1968).
- A zipfile
containing the C source code and corresponding DOS executable for a
program which will check for the flaw. Thomas R. Nicely
(26 April 2003).
- Original
e-mail message announcing the discovery of the Pentium divison
flaw. Thomas R. Nicely (30 October 1994).
- An account
of the spread of the Pentium flaw announcement across the Internet
during the first few days. Richard M. Smith, President of Phar Lap
Software, Inc. (27 December 1994)
- Pentium study. IBM Research, IBM Corporation
<ibmstudy@watson.ibm.com> (12 December 1994).
- The Pentium division flaw. Thomas R. Nicely. Virginia Scientists
Newsletter (April 1995) Vol. 1, p. 3.
- Untitled newspaper article concerning the Pentium division flaw.
Thomas R. Nicely. San Francisco Examiner, San Francisco CA USA
(18 December 1994) p. B-5.