ESA/STAT/AC.84/9

6 July 2001

English only

Symposium on Global Review of 2000 Round of

Population and Housing Censuses:

Mid-Decade Assessment and Future Prospects

Statistics Division

Department of Economic and Social Affairs

United Nations Secretariat

New York, 7-10 August 2001

Adapting new technologies to census operations*

Arij Dekker**

* This document was reproduced without formal editing.

** The Netherlands. The views expressed in the paper are those of the author and do not imply the expression of any opinion on the part of the United Nations Secretariat.

Adapting new technologies to census operations

by

Arij Dekker

Specialist in Census Technology

Paper prepared for

The Expert Group Meeting on

Global Review of 2000 Round of Population and Housing Censuses:

Mid-decade Assessment and Future prospects

United Nations Statistics Division

New York, 7-10 August 2001

The views expressed are those of the author.

Contents

  1. Introduction
  1. Management, communication, logistics, quality assurance

3.Data capture

3.1Intelligent Character Recognition (ICR)

3.2Automatic coding

3.3Outsourcing and decentralization

4.GIS, remote sensing and GPS

5.Data processing and storage

5.1Census processing software

5.2Data storage

6.Use of the Internet

6.1The Internet for data collection

6.2The Internet for data dissemination

7.Data dissemination – other issues

7.1Statistical disclosure control

7.2High-capacity physical media

7.3Structured archives – the statistical data warehouse

8.How to choose appropriate technology

9.More

10.Conclusions

11.Discussion

References

Glossary

9/1

1. Introduction

It is commonly known that the art of population census taking goes back many centuries. Ever since the end of the 19th century, there have been efforts to take advantage of a succession of newly available technologies to make such large and costly statistical enquiries more efficient and effective. A census is labor-intensive, requiring large numbers of temporary staff. Personnel costs usually are the principal component of census budgets, with expenditure for information and communication technology coming second.

Even small improvements in the methodologies used, or in the effectiveness of the equipment, can result in important gains in quality and/or expense of the whole operation. Census budgets depend on national cost levels and the depth of the enquiry, but generally vary between a few dollars per capita in low-cost countries to as much as 30 dollars per capita in highly developed environments. A rough estimate of the total expense of the current Round of Censuses would put it between 30 and 50 billion dollars. Certainly an enticing target for those trying to improve the rate of value-for-money.

The name of Herman Hollerith stands out as an early adaptor of modern technology to census work. He borrowed from the ideas of Joseph-Marie Jacquard, who had invented punched cards to control looms. Hollerith saw a way to use such cards in sorting and tabulation. By doing this he not only expedited the release of the results of the 1890 US census; he started an entire industry.

There have been many less-known census innovators who have put newly discovered methods and technology to good use. Information technology has usually been on the forefront of these efforts. Census data processing equipment has graduated from machines just assisting tabulation work, to indispensable tools in virtually all phases of census work. Computers are used for planning, to support mapping, in project management, in all stages of data capture, cleaning, coding, and reporting, and in demographic analysis [De97]. Many of the recent improvements in census taking have been possible thanks to the ever-growing capabilities of data processing equipment and communication networks operating on local, national, and world-wide levels. For the sake of continuity it is important that the use of newer technology is embedded into, and builds upon, existing sound methodology [UN98].

There are presently several important efforts to bring co-ordination and focus to the innovation process in official statistics and census taking. One is the Paris21 initiative: Partnership in Statistics for Development in the 21st Century. The members of Paris21 – there are several hundred of them – are drawn from leading national and international statistical agencies, academic institutions, etc. One of the several issues currently being reviewed by the experts combining their efforts under the Paris21 initiative is how census work can be made more cost-effective. See the web site for details [PaWW].

The United Nations Statistics Division has a long history of furthering sound statistical principles, and the sharing of know-how. A web site giving access to information on good statistical practices has recently been opened [UNWW]. On a regional scale, Eurostat has conducted a series of technical seminars by the names of NTTS (New Techniques and Technologies for Statistics) and ETK (Exchange of Technology and Know-how). The 2001 meetings on these issues have been conducted in a combined form on Crete, Greece, last June.

Noteworthy also is the Eurostat web site by the name of VIROS, Virtual Institute for Research in Official Statistics [EuWW1]. VIROS identifies and classifies areas of research where participating organizations may place the results of their studies and experiences, while remaining entirely responsible for it. Eurostat acts as a central co-ordinator, attempting to integrate the individual elements into a coherent set. The ultimate goal is to facilitate access to information on research activities and results. Eurostat is naturally interested in such issues, facing, as it does, the need to combine many statistical traditions, and overlaying them where possible with state-of-the-art integration technology.

When considering the technological options before them, census offices face a number of questions. Some of these are:

  • how to make an informed choice in selecting appropriate technology;
  • how to maintain the integrity of the existing statistical and census systems;
  • how to deal with the option of outsourcing[1], and management of outsourced tasks;
  • confidentiality concerns relating to the preferred solutions.

This paper will look briefly at various areas where census work has recently benefited from new technology, and will discuss the issues referred to above. Definite answers on the questions raised can be formulated only by individual census organizations themselves.

  1. Management, communication, logistics, quality assurance

A nation-wide census differs in many respects from day-to-day statistical work. It lacks the repetitive nature that allows collections with a greater periodicity to gradually be improved. The level of expenditure and number of staff are much higher than statistical managers are used to. Some governments therefore establish census offices separate from the national statistical agency. It may be necessary to recruit professional management, experienced in dealing with large but temporary organizations. Since a census can be seen as a large time-critical project, with many interlocking operations, the use of modern project management software is of vital importance.

A census operation requires efficient communication between (many) thousands of persons, as well as procurement and storage of a large variety of items, most of which have to distributed to all corners of the country, and then recollected.

Recent developments in mobile telephony (cell phones) have made person-to-person communication easier, even in countries with extensive and reliable fixed-line networks. But complete mobile coverage has not been accomplished in most developing countries. Census communication with remote areas continues to be problematic in some cases. It is still possible that satellite telephone systems, that function everywhere on earth, will fill this void. Some ambitious projects in this domain, such as that known as “Iridium,” have not drawn enough initial subscribers. But with most of the enormous investment costs now written off, user prices are coming down. The groundstations including antennas are still rather voluminous, but completely portable. Operations planners need to be cognizant of all communications options open to them, including the regional differences, and make arrangements accordingly.

Where printed or printable communication is required, fax technology is rapidly giving way to electronic mail. This is true for census operations, but relying on e-mail entails vulnerability to Internet service interrupts, computer illiteracy and virus attacks. It is important to always keep a fax capability for backup.

Improved computer software and wide availability of PC’s has made managingthe movement of goods much easier. Bar-code technology can be a key element in this. Using bar codes in stead of printed numbers has advantages in avoiding transcription errors and speeding up processing. A combination of the two can be used if easy human recognition of the codes may also sometimes be required. Census managers, who are not logistics professionals, tend to overlook this established technology.

A typical application of bar-code technology is to label all items specific for a particular enumeration area (maps, enumerator ID, summary sheets, transport box) with a specific bar code. At the point where the materials are sent out, the codes will be scanned, allowing automatic update of a database of items forwarded. The same process can be used to maintain a database of items retrieved from the field.

Labeling individual questionnaires with unique codes can also be helpful, although the resulting administrative overhead is considerable. Such identifiers can protect against the fairly common problem that entire batches of questionnaires arrive back erroneously geocoded. Standard retail scanners, but also most intelligent character recognition systems (see Section 3.1) will read bar codes without difficulty.

Quality assurance, including the use of scientifically-sound sampling methods, should be an integrating part of all census operations. Many of the methods in this field depend on statistical principle, and have been developed by statistical innovators [De86]. The census office must thrive to a consistent level of assured quality throughout its operations, and can not afford to disregard the techniques that help to achieve and verify it [SS01].

3.Data capture

3.1 Intelligent Character Recognition (ICR)

It is probably true to say that the current Round of Censuses has seen the breakthrough of ICR technology. In the 1985-1994 Round only about 20 % of countries undertaking censuses used some form of character or mark recognition [De94]. The large majority still relied on keyboard data capturing. In the current Round nearly all census offices of industrial market economies - and numerous other ones - apply imaging through scanners, recognition software, and what more is required to (partially) do away with manual data entry.

There is no doubt that recognition technology has made great strides in the last decennium, but it seems true also that the example provided by census “pioneers” has made switching course easier for those organizations that otherwise might have hesitated. ICR offers a promise of greater efficiency, but is inherently riskier than keyboard data entry. For example: poorly designed or badly printed questionnaires are a nuisance in manual data entry, but may sink an anticipated ICR data capturing operation. The need for elaborate pre-tests, already so obvious in traditional census taking, is even more apparent when scanning technology is to be used.

The main fundamental problem still existing is that handwritten characters are often poorly recognized where the writer is not already familiar to the recognition system. In censuses which use auto-response or a large number of enumerators, this obviously is the case. To avoid the problem, it is possible to limit the automatic recognition to marks or numeric digits only. But even digits can not always be reliably interpreted, so quite a few manual data-entry personnel will still be required to fill the gaps.

Scattered information suggests that the ICR process proceeds not always as smoothly as anticipated. Experiences obtained during the final operations tests induced the US Bureau of the Census to move from a one-pass to a two-pass processing system, where sample data from the long forms will only be computer-stored during a second capturing operation [Pr00]. This change of approach has had no effect on processing deadlines. Some European countries (for example: Estonia) have reported difficulties in recognizing handwritten alphabetic characters, requiring them to hire additional staff to assist the automatic recognition process. A recent meeting in Bangkok [UN01] heard about problems of varying severity in Thailand, the Philippines, China, Macao SAR, and Indonesia[2]. For information on the details of the problems experienced, retrieve the country papers from the Web site referred to.

In Thailand, earlier plans to establish 15 regional ICR centers for the April 2000 Census were cancelled after more sophisticated (and expensive) scanners and software turned out to be required. A single ICR complex now operates in Bangkok (Fujitsu 4099 scanners, TeleForm software). Some problems were reported with poorly written characters and scanner maintenance.

The May 1st, 2000 Census of the Philippines works with four decentralized capturing centers, using Kodak 3590 scanners and Eyes and Hands software. One of the biggest problems here is that the print quality of some questionnaires is not in accordance with specifications, which causes the ICR software to tag them as unidentifiable. Another difficulty is illegible handwritten entries. The number of verification licenses, required to manually correct such rejects, had been underestimated. This has been a learning process. Experiences are sufficiently positive to use ICR again for the upcoming Census of Agriculture and Fisheries.

China, Macao SAR reports good results for its pilot operation for the 2001 Census. The paper contains an interesting table, obtained from a sample of 150,000 images of digits. The table does not immediately confirm the effectiveness of ICR as implemented. It would seem useful to dispense training to enumerators about how to best write certain numerals.

Digit / 0 / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / All
Recognition rate (%) / 94.83 / 96.83 / 94.92 / 91.11 / 96.00 / 94.95 / 97.29 / 97.72 / 90.43 / 81.74 / 95.64
Reject rate (%) / 5.17 / 3.17 / 5.08 / 8.89 / 4.00 / 5.05 / 2.71 / 2.28 / 9.57 / 18.26 / 4.36
Accuracy rate (%) / 99.38 / 99.89 / 99.78 / 99.73 / 99.89 / 99.41 / 99.79 / 99.59 / 99.12 / 100.00 / 99.72
Error rate (%) / 0.62 / 0.11 / 0.28 / 0.27 / 0.11 / 0.59 / 0.21 / 0.41 / 0.88 / 0.00 / 0.28

ICR for the July 1st, 2000 Census of Indonesia is handled by 29 processing centers throughout the country, using Kodak DS 3500 scanners and NCS NestorReader recognition software embedded in own Visual Basic programming. The country paper reports many troubles that hamper the census ICR operation. These include sub-standard questionnaire printing (despite elaborate quality controls), poor writing by enumerators, inadequate document handling in the field resulting in unusable forms, scanner maintenance problems, and complex file management. The authors deserve the highest praise for sharing these experiences for others to learn from. The massive nature of the operation in Indonesia, scattered civil unrest, financial constraints, and various logistics problems have obviously all been a factor here. Despite the difficulties, CBS Indonesia is confident that the data capture operation will be completed successfully.

The October 2000 Census of Aruba (not reported in Bangkok) used Fujitsu M3079DG scanners and Eyes and Hands software. All data for this small country of about 100,000 people were captured by April 2001. The operation was quite carefully prepared, and proceeded smoothly, including the integrated computer-assisted coding work. There were no cost advantages compared to keyboard data entry.

Such problems as are reported can be divided into those that have to do with the recognition process itself, and all other ones. If the recognition rate is unacceptably low, this can usually be remedied by reducing the pre-set security level. But there is a price to pay: error rates will go up. Other problems may include unreliable paper transport in the scanners, which can have plenty of causes, including dirt, the use of “white-out” on sheets, and damaged forms, possibly as a result of bad weather conditions. It is not unheard of that such difficulties require large numbers of questionnaires to be transcribed, again increasing error rates.

As a general rule, success is often reported by census offices that went through a long and careful preparation process, including several pre-tests. Those that have to cut short on the groundwork may become the source of less fortunate stories. Complete quality assurance management – for example in the printing process of the questionnaires – is of the essence here.

If recognition of handwritten text is now becoming a more reliable tool, it would be logical to think of speech recognition as the next step. After all, this is a more direct method of data collection. Speech recognition has broad economic potential, and is a topic of much research. Some commercial applications of this technology are appearing, especially in processing verbal instructions received by telephone, and in the automotive industry. But progress in this area has been slower than expected. Statistical applications are still rare.

3.2Automatic coding

Recognizing verbal texts usually has for purpose to accommodate associated automatic coding. That is, the computer reads a text, for example the name of a geographic area, and then selects the applicable code from an associated file or database.

Such solutions, which ideally would allow completely automatic data capture and coding, depend on two pre-requisites: (i) the recognition process must be sufficiently reliable and, (ii) the search algorithms do indeed lead from the recognized term(s) to the appropriate code. A 100% character recognition rate is not required, since the algorithm may still be successful with incomplete or partially mangled terms.

However, there are indeed problems with this process. First there is the recognition reject rate, as referred to above, which might require an unexpected level of human interference. Next comes the difficulty of automatically determining the applicable codes, the severity of which depends on the nature of the variable concerned. Geographic terms are usually not too difficult to code automatically, except perhaps for the lowest level (e.g. village), where spelling my not be standardized and homonyms occur. Occupation and Industry tend to be more problematic. Despite the efforts by census field staff to extract full information from respondents, these variables will often be reported in terms that can not be easily linked to ISCO, ISIC or NACE codebooks.