Data Mining of Official Data Bases

Mirjana Pejic-Bach, Ksenija Dumicic

University of Zagreb, Faculty of Economics
Trg J.F.Kennedya 6
10000 Zagreb, Croatia
,

1.  Introduction

Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner (Hand et al., 2001). Data mining techniques exist for a number of years and its roots are traced back along three family lines: classical statistics, artificial intelligence, and machine learning. In the last ten years, data mining has become one of the most popular hypes of the business world. However, public organizations only recently bring up use of data mining (Cahlink, 2000; Carbone, 1998).

Just like the business organizations, with the widespread use of information systems that include databases (Dumicic, 1999) which have recently featured explosive growth in their sizes, public decision makers are faced with a problem of making use of the stored data. Goal of this paper is to investigate the possibility of using data mining to explore official data bases as a tool for improving efficiency of public organizations.

2.  Data mining applications using official data bases

Applications of data mining using official data bases for public organizations are found from the Internet search (Google) with the use of words: data mining, government, and public. Following databases are also searched: Emerald, EBSCOhost, Proquest, Science Direct, Springer Verlag, Kluwer, Engineering Village 2 & Compendex, Wiley Interscience, and ProQuest Digital Dissertations.

One should be aware that it is not possible to find every single data mining application in public organizations by the search of the scientific data bases or Internet. However, the presented survey can give substantial insight into the current practice of data mining in public organizations. We found 34 applications of data mining in public organizations. The oldest application is described in 1996. However, most of the applications (64,5%) are described in articles published in 2003, and we can conclude that application of data mining in public organizations grows exponentially.

Finance and economy (29%) had the largest number of applications followed by healthcare (24%), criminal justice and defence (24%). Other areas that we have examined are labour and social welfare (6%), e-government (6%), education (9%) and transport (3%), but all of them have rather small number of applications.

Applications are described at the business web sites, news web sites, scientific journals, and working papers. Most of the applications (62%) are described at business web sites, and the leader is SPSS followed by IBM. It should be emphasized that only particular applications, and not advertisements, described at their web sites are taken into account. Only 21% of applications are described in scientific journals, and 9% are found on the news web site or are described in the working paper.

Method used is described at only 18 sources. Classification and prediction is most often used (44%), and is followed by evolution analysis (22%), concept/class description (17%) and outlier analysis (6%). Other methods like association analysis are not described to be used.

3.  Examples of applications

US government tax agencies use Clementine and Intelligent Miner to build a predictive model that could improve collections management and audit selection by answering questions such as "Who is likely to become delinquent and by how much?" and "Which tax returns are likely to be non-compliant?". Neural networks have provided valuable insights for analysts forecasting tax revenues, which are critically important since agency budgets, support for education, and improvements to infrastructure all depend on their accuracy.

Data mining is often used in detecting health care fraud. IBM Fraud and Abuse Management System is used for detecting health care fraud and abuse which ranks as one of the nation’s leading law enforcement frustrations in USA.

Defense Advanced Research Projects Agency (DARPA) is developing the database called the Total Information Awareness System as part of its effort to track terrorists and their activities. However, many critics feel that the project is a threat for security because such a massive database would be very attractive for hackers to go after.

4.  Census data mining

Applications often use census data, which is one the most comprehensive databases. Often spatial data analysis is employed, with the common applications in the social sciences that range from the discovery of crime clusters, hot spots and the detection of disease clusters, to spatial autocorrelation of demographic variables and regression models for real estate analysis. Other than described, possible uses of census data are: (1) business statistics with special mentions for innovation policy, and financial health, (2) household equipments and savings, (3) health statistics (mortality and morbidity) in order to detect unexpected risk factors, and (4) analysis of metadata information by means of text mining (Saporta, 2000).

5.  Conclusions

This paper has reviewed data mining of official data bases. Readers should be cautious in interpreting the results of the survey, since the findings are based on data collected from the business web sites, journal articles, news web sites and working papers. Such approach is employed because data mining applications in public organizations are still rarely described in journal articles. However, we feel that even such a survey can describe the current state of the issue.

ReferenceS

Hand, D., Mannila, H., Smyth, P. (2001) Principles of Data Mining. Cambridge, MA. The MIT Press.

Cahlink, G. Data Mining Taps the Trends. (2000) Government Executive Magazine. http://207.27.3.29/tech/articles/1000managetech.htm

Carbone, P.L. (1998) Data Mining and the Government: Is There a Unique Challenge? The On-Line Executive Journal for Data-Intensive Decision Support. http://www.tgc.com/dsstar/98/0519/980519.html

Saporta, G. (2000). Data Mining and Official Statistics. Quinta Conferenza Nationale di Statistica, ISTAT, Roma, 35-39.

Dumičić, S, and Dumičić, K. (1999). Experience on Automated Coding of Occupation in Population Census in Croatia. Bulletin of the 52nd Session of the International Statistical Institute, Proceedings of the Contributed Papers, Tome LVIII, CD, Helsinki, Finland. http://www.stat.fi/isi99/proceedings/arkisto/contributed.html

RéSUMÉ

L’objectif de cette étude est de présenter les différentes applications de la recherche des données dans les bases de données officielles pour les organisations publiques. La recherche des bases de données scientifiques et l’Internet a découvert que la majorité des applications est décrite en année en cours sur les web sites d’affaires. La finance et l’économie, la santé publique, la justice criminelle et la défense sont les domaines les plus populaires. Les méthodes le plus souvent utilisées sont la classification et la prédiction, la description du concept et de la classe et l’analyse de l’evolution.