CHAPTER 6: DATA MANAGEMENT
(GIS: A Management Perspective – Stan Aronoff)
Pages 151 - 187
For an organization to function effectively, it requires accurate and timely information.
Based on this need, it is easy to see why the business community first adopted computer-based data storage and retrieval technology. In the 1960s large engineering projects like the space program required an enormous amount of inventory and a database system was used to manage this information.
Another application of the 1960s was the Sabre airline reservation system developed by IBM and American Airlines.
Since these early beginnings, the business community has invested heavily in data base technology to gather and maintain information.
As the information systems field developed during the 60s and 70s, the concepts of database and data base management systems were developed and refined. Today, database management systems handle enormous databases such as the national census.
A database is the information to be stored whereas the database management system is the system used to manage the database.
Specifically, a database is a collection of information about things and their relationships to each other. For example, a database of names, addresses and relationships (client, relative, friend).
The objective in collecting and maintaining information in a database is to relate facts and situations that were previously separate.
There are historically two approaches to database management. The first is the file processing approach and the second is the more recent database management system approach.
The file processing approach (Figure on page 152) required that the data be stores as one or more computer files that were accessed by the special purpose database software in whatever manner the designer believed to be most efficient.
File processing is the most common approach to using a database.
Drawbacks of the file processing approach:
Since each application program must directly access each data file that it uses, the program must know how the data in each file are stored. The can create redundancy because the instructions to access a data file must be present in each application program.
Another problem exists when data are shared by different application programs and by different users. If data files are accessed and modified by several programs and users, then there must be some overall control over which users are given access to the database and what modifications they are permitted to make. A lack of central control can seriously degrade the integrity of the data.
A database management system (DBMS) is comprised of a set of programs that manipulate and maintain the data in a database. This is the second approach: The DBMS Approach.
The DBMS were developed to manage the sharing of data in an orderly manner and to ensure that the integrity of the database is maintained.
A DBMS acts as a central control over all interactions between the database and the application programs, which in turn interacts with the user.
When the programs, such as order entry services or geographic analysis functions, required access to the database the DBMS acts as the intermediary and supervisor.
One of the major benefits of a DBMS is that is provides data independence. The application program does not need to know how the data is physically stores because all access to the database is via the DBMS.
The application program issues a command to the DBMS that retrieves and "re-packages" the data into the format needed by the application. This greatly reduces the effort needed to maintain the application programs and the database. Many DBMS incorporate a direct user interface.
A DBMS is also used to tailor the style of information presented to the different users.
In Figure 6.3 page 153 the same dataset is presented in two different ways depending on the needs of the users. Account executive view versus the inventory management view
This ability to present the data in different ways is a very valuable function and does not store multiple copies of the same database.
ADVANTAGES OF THE DATABASE APPROACH (over the file processing approach)
- Centralize Control: A single DBMS under the control of one person can ensure that data quality standards and the integrity of the data are maintained.
- Data can be Shared Efficiently: Using a DBMS, the information in a database can be shared in a flexible yet controlled manner. Also facilitates the development of new applications of the existing data.
- Data Independence: Application programs are independent of the physical form in which the data are stored.
- Easier Implementation of New Database Applications: New application programs and unique database searches can be more easily implemented using the services provided by a DBMS.
- Direct User Access: Database systems now commonly provide a user interface so that non-programmers can perform sophisticated analyses.
- Redundancy Can Be Controlled: In a file-processing environment, separate data files are used for each application and data is stored more than once. Excessive data redundancy is expensive. In addition, an effective strategy must be provided to update the multiple copies of the data. A DBMS can be used to monitor and reduce the level of redundancy, as well as manage the updating procedures.
- User Views: A DBMS can provide a convenient user interface to create and maintain multiple 'views' of the data.
DISADVANTAGES OF THE DATABASE APPROACH
- Cost: The database system software and any associated hardware can be expensive. At a minimum, they represent an additional acquisition and maintenance cost.
- Added Complexity: A database system is more complex than a file processing system. In theory, the more complex the system, the more susceptible it is to failure and the more difficult the recovery. In practice, full-featured DBMS are provided with effective backup and recovery systems.
- Centralized Risk: In centralizing the location of the data and reducing data redundancy, there is a greater theoretical risk of loss or corruption of data while running an application program. However, the backup and recovery procedures normally provided in a DBMS minimize the risks.
The first GIS used a file-processing database and many still do. However, the trend is increasingly towards the use of a DBMS, if not to manage all the data in a GIS at least to manage the non-spatial attribute components.
Virtually all commercial GIS now incorporate some form of DBMS.
DBMS TERMINOLOGY
Record: a small group of related data items stored together. One row in the table.
Field: a record is divided into fields. A field defines the attributes in the record.
Key: a label comprised of one or more fields and used as a search foundation.
Query: a search through the database.
THREE CLASSIC DATA MODELS
The conceptual organization of a database is termed the data model. It can be thought of as the style of describing and manipulating the data in a database.
There are three classic data models that are used to organize electronic data bases: The Hierarchical, the Network, and the Relational Models.
THE HIERARCHICAL DATA MODEL
In a hierarchical data model, the data are organized in a tree structure. See Figure 6.5 on page 156. The organization is encoded _n the data records for each entity.
There is one field that is designated as the key field and is used to organize the hierarchy.
The top of the hierarchy is termed the root and is comprised of one entity. Except for the root, every element has one higher level element related to it, called the parent, and one or more subordinate elements, termed children.
In the hierarchical data model, every relation is a many-to-one relation or a one-to-one relation.
The many Departments belong to one University; there are many students in each department.
Retrieval of all the students or all the professors in a specific department is a very efficient search because there is a direct link between student and department entities and between professor and department entities.
However, to find all the courses offered by a specific department requires a two stage search. First, the records for all the professors teaching in that department would be retrieved and then the courses that each of those professors taught would be retrieved.
This is a less efficient type of retrieval because an intermediate entity, the professors must be retrieved. This type of retrieval can still be efficient if it does not involve too many intermediate levels.
In the hierarchical model an entity can have only one parent, so the Course entity is not permitted to have both the Department and Professor entities as parents.
Another limitation of this model is that searches cannot be cone on the attribute fields. In this example, you cannot retrieve all second year students because the Year field is not a key.
Hierarchical systems are easy to understand and easy to update. They also provide high speed access to large data sets.
This is a good system for bibliographic databases and airline reservation systems when the types of searches are very predictable and can be tightly specified.
The major disadvantages of the hierarchical model are that the data relationships are difficult to modify and queries are restricted to traversing the hierarchy. Geographic information analysis searches are often exploratory and cannot be predicted in advance. Another disadvantage is that multiple parents are not allowed.
THE NETWORK DATA MODEL
In the network data model, an entity can have multiple parents as well as multiple children and no root is required. The data records can be directly searched without traversing the entire hierarchy above that record. Figure 6.6 page 158.
The Course entity can have two parents in the Department and Professor entities.
A search of all courses in a specified department can now be done more directly than in the hierarchical example.
The Student-Course relation is a many-to-many relation. Each student can be enrolled in many courses and each course can have many students.
However, this model does not allow many-to-many relations, this relation is handled indirectly by using an intermediate relation or intermediate record.
For example, the intersection records represent the registration of students in courses or the Student-Course combinations. Each Student-Course combination is unique. One Course entity can have many Registration entities and one Student entity can have many Registration entities.
Network data models tend to have less redundant data storage than the corresponding hierarchical model. However, more extensive linkage information must be stored, adding to the size and complexity of the data files.
When the data structure to be represented is in fact a simple hierarchy, there is no real difference in the expressive power of these two models. However, where a more complex real-world data structure must be represented, the network model can accommodate the added complexity.
As with the hierarchical model, the relations among data elements are encoded in the database. This provides high speed retrieval, but the data relationships are difficult to modify. The principle disadvantages of the network model are that it is more complex than the hierarchical model and not as flexible as the relational model.
THE RELATIONAL DATA MODEL
Figure 6.7 In the relational data model there is no hierarchy of data fields within a record and every data field can be used as a key.
The data are stored as a collection of values in the form of simple records or tuples (rows). The tuples are grouped together in two-dimensional tables with each table usually stored as a separate file.
The table as a whole represents the relationships among all the attributes it contains and is called a relation.
Using the relational model, a search can be made of any single table using any of the attribute fields, singly, or together. For example, . . . . .
Searches of related attributes that are stored in different tables can be done by linking two or more tables using any attribute they share in common. This is a join operation. See figure 6.8 page 159.
By including only the data fields required, redundant data storage is reduced. In fact, table 6 does not have to be stored at all; it can be created as a virtual table.
As can be seen in table 6, there is a certain amount of redundancy in a relational table. The Course-ID, Course Department, and Course Name information is repeated. However, each row (tuple) is unique. There should never be two identical rows because there is no need to store the same fact twice.
Advantages of the relational model over the hierarchical and network models.
- The relational model is more flexible than other models. The way the data values exist in the relational tables does not in any way restrict the kinds of processing that can be done. In the hierarchical and network models, manipulation of the data is restricted by the structure built into the data model.
- The relational model has a sound theoretical base in mathematical theory. You can use the mathematics of relations as the basis for data processing procedures.
- The organization of the relational model is simple to understand and, therefore, a good vehicle to communicate database ideas.
- The same database can generally be represented with less redundancy using the relational model than the other two models.
Disadvantages of the relational model.
- It is more difficult to implement.
- It tends to have slower performance. The absence of pointers (a code that indicates a location in a file, such as the location in a file where the attributes of a geographic feature are stored) requires that manipulation of the data be based on matching values in the relational tables.
- This is a much more time consuming operation and, as a result, a relational data base system tends to be significantly slower than the corresponding hierarchical or network data bas system.
THE NATURE OF GEOGRAPHIC DATA
The map is the most familiar form for representing geographical data. A map consists of a group of points, lines, and areas that are positioned with reference to a common coordinate system.
The map legend links the non-spatial attributes, such as place names, symbols, and colors to the spatial data (the locations of the elements).
The map itself serves to both store the data and to present the data to the user. In a computer-based GIS, the storage and presentation of geographic data are separate. And the same data my be viewed as many different types of maps.
In addition to maps, the data may be presented in the form of tables or text descriptions.
In a computer-based GIS, geographic data are represented as points, lines, and areas as with maps. However, for efficient computer implementation, these elements are organized somewhat differently than the organization of a paper map.
The information for a geographic feature has four major components: Its geographic position, its attributes, its spatial relationships, and time.
Geographic Position (where is it?)
Each feature has a location that must be specified in a unique way. For geographic data, locations are recorded in terms of a coordinate system like Lat./Longs, UTM, or SPC.
A GIS requires that a common coordinate system be used for all the datasets that will be used together
Attributes (What is it?)
The second characteristic of geographic data are their attributes, non-spatial attributes.
There is a level of inaccuracy inherent in non-spatial attribute data as there is for spatial data.
A commercial district may not be 100% commercial and a pine stand may not be 100% pine.
Often this type of inaccuracy is not addressed by GIS users, but for many types of analyses it is most important to recognize and take into account this imprecision.
Spatial Relationships (What are its relationships?)
The spatial relationships among geographic data are very numerous and often complex.
For example, it is not only important to know the location of the fire and the fire hydrants, but also how close those fire hydrants are to the fire.
This relationship is intuitive to the person reading the map but must be expressed in a computer-compatible manner.
Because it is not possible to store information about all possible spatial relationships, only some of them are stored and others are either calculated as needed or not available.
Time (When did this exist?)
Geographic information is referenced to a point in time or a period in time. Knowing when the data was collected can be important.
The representation of time in a GIS is an added level of complexity that is difficult to handle.
Taken together, these four attributes make geographic data uniquely different from other types of data.