Benford's Law: A Filter for Fraud
Lauren Clarke
It's harder than you may think to make up credible numbers. More importantly, it's harder to make up credible numbers than the average crook thinks it is. People tend to have incorrect ideas of what constitutes a "random" set of numbers. This fact can work to your advantage if you're looking for possible fraud in sets of numerical data. In this article, Lauren Clarke gives you some tools to spot odd patterns in data and perhaps bring fraudulent data to light.
Recent events in the financial world have turned a spotlight on how numbers can be misleading. While Enron's troubles seem more related to questionable accounting practices than to fabrications of raw data, it does serve as an example of what can happen if the data used to inform mission-critical decisions is at odds with reality. Those of us who collect and preside over transactional datasets do well to consider the impact poor data would have on the organizations we work with. It's in our best interests to take efforts to ensure that data is accurate and free from fraud. Typical referential integrity checks and file integrity checks are excellent approaches to catching bugs or file system problems that can lead to bad data.
But consider that your adversary may not be hardware and software glitches, but rather a thinking, plotting, human being who has reasons to subvert the data you're so carefully storing. How would you take reasonable steps to detect such an effort? This article will give you some tools to analyze data at a deeper level and, to some extent, automate the process of guarding against fraudulent data.
Digital Analysis (DA) is the term used for analysis done on sets of numerical data to look at certain patterns in the numbers. In other words, the high cost of a $12,345.67 toilet seat isn't remarkable from a DA perspective. What is remarkable about that cost is that it contains the digits "1234567" in smallest-to-biggest order. Coincidence? Maybe, but what if a large proportion of the costs in a given set had digits that were ordered that way? It might give you cause to dig a little deeper into those numbers.
Digital Analysis can help in detecting things like "rounded" numbers, replicated numbers, and numbers that don't match expected digit patterns. Digital analysis can be used to detect fraud, errors, inefficiencies, and even software bugs. An integral part of DA is a little-known law of numbers known as Benford's Law. Benford's Law predicts the proportions of certain significant digits in many different sets of data. Many DA techniques leverage this information as a tool for detecting sets of numbers that don't behave as expected.
A brief history of the law
In 1881, an astronomer named Simon Newcomb published a paper that described something he'd noticed in the logarithm books used at the time to aid in the multiplication of large numbers. In those days, when you needed to multiply some large numbers together, you didn't double-click on Excel. Rather, you walked to the library and looked your numbers up in a logarithm reference, added the logs, and then looked up the anti-log of the sum.
He noted that the pages corresponding to numbers that started with 1, 2, and 3 were more worn than those for numbers starting with higher digits. In fact, the wear on the pages decreased monotonically as the first significant digit increased from 1 to 9. (Note that the numbers 876 and 0.0876 both have the same first significant digit: 8.) Because these logarithm books were used by many different people, Newcomb concluded the odd page wear must be because numbers in general were more apt to start with lower digits than with the higher ones. He also made a conjecture (without proof) that the probability of any number selected from the measurements of objects in the universe starting with the digit "d" was as shown in Formula 1.
Table 1 lists the probability predicted by this formula.
Table 1. Probability for first digits as predicted by Benford's Law.
Digit / Probability1 / 0.301
2 / 0.176
3 / 0.125
4 / 0.097
5 / 0.079
6 / 0.067
7 / 0.058
8 / 0.051
9 / 0.046
This isn't what one would expect at first guess. Given that numbers can start with the digits 1-9, most people would likely guess that the probability of a particular digit being first would be the same for all digits, namely 1/9 or about 11 percent. Newcomb's hypothesis contended that the first digit of numbers that were measurements of things tended to be "1" more than 30 percent of the time, certainly not an intuitive result.
Years later (57 to be exact), an engineer at GE named Frank Benford independently discovered the same situation in his logarithm books. He took it a step further and collected data from a number of different sources ranging from baseball statistics, to areas of bodies of water, to populations of states. His 20,229 different data points fit the formula (1) very well. Table 2 gives some of his results.
Table 2. First-digit distributions from empirical data collected by Benford.
1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9Rivers, Area / 31.0 / 16.4 / 10.7 / 11.3 / 7.2 / 8.6 / 5.5 / 4.2 / 5.1
Population / 33.9 / 20.4 / 14.2 / 8.1 / 7.2 / 6.2 / 4.1 / 3.7 / 2.2
American League / 32.7 / 17.6 / 12.6 / 9.8 / 7.4 / 6.4 / 4.9 / 5.6 / 3.0
Reader's Digest / 33.4 / 18.5 / 12.4 / 7.5 / 7.1 / 6.5 / 5.5 / 4.9 / 4.2
He published his findings in 1938 in the Proceedings of the American Philosophical Society, and the article drew so much attention that the law became known as "Benford's Law" even though Newcomb had discovered it first. Incidentally, the reason it drew attention was likely due to its placement in the journal, situated adjacent to a now famous article on the scattering of electrons. While Benford didn't provide a proof for this law, his empirical evidence was compelling enough for many people to consider it a valid "rule of thumb."
Over the next few decades, several papers were published that added several important corollaries to the law. Perhaps the most important finding was that the law was scale-invariant. That is, if a set of numbers adhered to Benford's Law, then one could multiply those numbers by any scalar multiple, and the resulting set would also be a "Benford set." This is important if you consider that it meant currency conversions or unit conversions such as from inches to centimeters wouldn't affect the expected distributions of first digits. Moving forward to more recent history, the law was actually statistically proven by Theodore Hill in 1996 in his paper titled "A Statistical Derivation of the Significant-Digit Law," which you can read for yourself at www.math.gatech.edu/~hill/publications.
An intuitive explanation
Just in case you're not inclined to wade through Dr Hill's paper, here's a brief heuristical explanation of why Benford's Law holds. In short, things start small and tend to grow at a rate proportional to their size. For example, consider the month-to-month balance of a savings account in which your initial balance is $900 and the annual compounded interest rate is 6 percent. Barring any withdrawals or deposits, it would take just two years to reach a value of $1,000, thereby changing the value of the first digit. During that time, your account balance would be in the $900s—the first digit would be 9. After the balance has broken into the "thousands," the first digit of the balance will be 1 until the balance reaches $2,000. At 6 percent annual interest, this would take an additional 12 years. You can see that a set of data consisting of your monthly balances over the 14-year period would have many more numbers starting with a 1 than with a 9.
More importantly, a dataset of all the accounts at a bank at a particular point in time would likely display the same distribution (unless the bank had just opened). This has nothing to do with exponential growth per se; rather, it has to do with our number system. If an entity is growing (or shrinking) at any rate, it will tend to move through the numbers starting with 7, 8, and 9 faster than it will the numbers starting with 1, 2, and 3. In general, be it bank accounts, rivers, stock prices, or galaxies, the universe is filled with measurable objects, and there tend to be more smaller ones than bigger ones, and those that are growing tend to grow in ways that cause their measurements to adhere to Benford's Law.
Extensions to the Law
A more general form of Benford's Law may be applied to all the digits in a number as shown in Formula 2.
You can see an example in Formula 3.
This means we can look at first digit, second digit, first and second digit, and so on and have expected distributions to compare to.
Not a one-law-fits-all solution
It's important to note that Benford's Law doesn't apply to every conceivable set of numbers. This should be obvious. For example, the set of all ZIP codes of the West-Coast states doesn't include any that start with the digit "1." In general, the law doesn't apply to assigned numbers. The numbers must be a representation of the size of some physical, or at least measurable, entity. So one wouldn't expect the primary keys of a table or the phone numbers out of a phone book to conform to Benford's Law. Here are some general guidelines for predicting if a given set of numbers might conform to Benford's Law:
• The data should describe the size of a set of similar entities.
• The data should be "un-cohersed"—that is, it shouldn't have any built-in maximums or minimums (zero as a minimum is acceptable, however). Also, there shouldn't be any "special" numbers. For example, in expense report data, the allowed per-diem amount may represent an artificial limit that may skew the data.
• The data shouldn't be assigned numbers. As the first guideline stipulates, the data should be related to size measurements of some kind. Assigned numbers don't meet this criterion.
• The entities the data describes should include more small items than big items.
If your data doesn't fit these criteria, you can still make use of Digital Analysis techniques. In some cases, the data may still approximate Benford's Law. Even if it doesn't, knowing why it doesn't and establishing your own expected distributions of digits based on historical data can lead you to effective tests for anomalous data as well.
Putting the law into practice
At first glance, Benford's Law looks like a useless bit of mathematical trivia. In fact, this was the general consensus for quite some time. Apart from bilking unsuspecting people in wagers on the first digit in the first number found on a page of their choice in the Farmer's Almanac, this information seems to have little practical use. However, consider that people making up numbers won't be likely to replicate the Benford distribution in their fabricated numbers. In fact, it's been shown that 6 and 7 seem to be favorite first digits for numbers made up at "random" by human beings. As such, a set of fictitious numbers wouldn't display the characteristic Benford distribution, and this fact might be an important "filter" for identifying fraudulent data. The basic idea is to check the "weekly expense report data from salesperson xyz" against the expected Benford distribution and note any discrepancies. Keep in mind there may be aspects of expense data that may throw the numbers off even for legitimate data, but these should be known factors and can likely be accounted for. Remember, Digital Analysis is a means of discovering where to look for fraud; it shouldn't be used as the sole identifier of fraud.
Given that Benford's Law gives us some predictions about first, second, and other digit distributions in data, how can we check our data against it? The first step in this process is to determine what the digit distributions are in a given set of numbers. The following is a function that will determine particular significant digits of a number. Care must be taken to ensure currency values are handled correctly. We're using Pradip Acharya's xVal() function (published in the February 2002 issue of FoxTalk) to allow this function to take non-numerical inputs and convert them to numerical intelligently.
function digit( tvNumber, tnSig , tnLen )
* returns the tnLen significant digits of a number
* starting with with the tnSig'th significant digit
local lcNum, lnNum
if vartype( tnLen ) # "N"
tnLen = 1
endif
*-- standardize vartype
lnNum = xVal( tvNumber )
*-- convert currency
if vartype( lnNum ) ="Y"
lnNum = mton(lnNum)
endif
*-- standardize mantissa
if !empty(lnNum)
lnNum = abs(lnNum / ;
10^(round(log10(abs(lnNum))-1,0)))
endif
lcNum = strtran(transform(lnNum),".")
return val( substr( lcNum ,tnSig,tnLen) )
For example,
? Digit( 1234.567,1,1 ) & prints 1
? Digit( 1234.567,2,4 ) & prints 2345
? Digit( 0.0876,2,2 ) & prints 76
With this function (actually, it's a method of a "BenfordCalc" class provided in the Download file), one can determine the particular digits in certain locations in any number. If used over an entire set of records while keeping track of the counts of each type of digit in each location, a distribution of digits can be obtained from any set of numbers. In the end, you have a relative frequency table for the set of data at hand that looks like Table 1, but reflects the distribution of digits in your data. Here's an example of how to extract such a distribution, using the orders.dbf table provided with the TasTrade sample dataset.