Teaching Data Modeling and Database DesignGraeme Simsion, June 2005

Background

I have been teaching data modeling to information systems professionals at the introductory, advanced and “master class” levels, and to college undergraduates and postgraduates since the late 1980s.

I have been involved in the development of data modeling courses and curricula at Melbourne, Monash and Swinburne Universities in Australia, and liaised with a number of colleges that use my book “Data Modeling Essentials[1]” (first published 1994).

From 1982-1999, I managed a consultancy that was Australia’s leading provider of data modeling consultancy and education, and am currently conducting research on data modeling practice through the University of Melbourne. We hired a number of graduates into trainee data modeler positions, and assisted clients to do the same.

Effectiveness of Education and Training

In my experience, data modeling and database design are often not well-taught, in the college / university environment, particularly in terms of providing students with skills that will be useful in industry. Specifically:

  • Students who have completed introductory (and even advanced) courses in data modeling are frequently unable to develop even very simple models in real situations (as distinct from using descriptions devised by an instructor with a ‘correct answer’ in mind).
  • Courses address aspects of data modeling / database design that students are unlikely to be able to apply in practice – or not until they have substantial practical experience. Examples include database tuning (for small projects, likely to be unnecessary, for large projects likely to be the responsibility of a specialist) and knowledge of alternative modeling conventions.
  • Normalization – a key technique – is often taught in such a mathematical fashion (typically without the use of diagrams) that students fail to gain any intuitive understanding of its use; without that understanding the skill is quickly forgotten.
  • The conventions taught in many colleges are not those widely used in industry. To some extent this reflects academic interests in alternative approaches (e.g. ORM) that do not have a significant user base, but there is a common misunderstanding, reflected in many texts, that the Entity-Relationship approach (Chen, 1976) is dominant in industry. In fact, practitioners generally (mis)-use the term “entity-relationship” to refer to the Information Engineering “crow’s foot” conventions supported by most documentation tools. UML is the emerging alternative standard.
  • Similarly, despite the dominance of relational DBMSs (even for persistent data in object-oriented applications) some courses and texts persist in covering outdated architectures – hierarchical and network models.

Goals of Teaching Data Modeling and Database Design

Virtually every information systems practitioner deals with structured data on a daily basis: specifying/writing programs and reports against a database, specifying additions and changes, developing coding schemes, evaluating and adapting the data structures in software packages. Knowledge of the principles of data representation and structure is fundamental to information systems practice. This is the domain of data modeling.

Conversely, database tuning (physical database design) is generally recognized as a specialized task – and one that has less direct relevance to the general IS practitioner.

In my view, the principal goal of teaching data modeling and database design is to provide students with an understanding of how data is represented and organized in modern databases, and of the principles and tradeoffs involved in good logical data organization. In an undergraduate course, it is not unreasonable to expect them to be able to devise sound relational (or extended relational) data structures to support simple (but real) user requirements.

This is not to deny the value of teaching foundational theory: but without a firm appreciation of how this translates into practice, the theory is quickly forgotten.

Some Guidelines

In my experience, there are four key rules for teaching data modeling and database design effectively:

1. Work Backwards

Data modeling is frequently taught before database design and even before SQL programming. While this reflects the sequence in which these activities are performed in practice, it means that the student does not have a clear idea of how the outputs of his/her activity will be used. The advice given to neophyte data modelers to “identify the things of interest to the business” needs to be qualified by “with an understanding that these will be implemented as tables”. Without this qualification, students will regularly represent almost any noun in the problem description as an entity. Instructors resort to removing such nouns from the problem description in order to lead the student towards the right answer – a less than satisfactory preparation for the real world.

I suggest the following broad teaching sequence, which will typically be spread across more than one course/unit:

  1. Programming against an existing relational database, preferably including one or two structural problems (not fully normalized, non-atomic data) to illustrate the importance of sound data structure.
  2. Modification of a relational database to incorporate new requirements (and programming of queries to prove the changes).
  3. Instruction and practice in relational data structuring techniques – including normalization.
  4. Reverse engineering of a data model diagram from the relational model (preferably using DSD/IE/”crow’s foot” conventions.
  5. Development of data model diagram “top down” from a reverse engineered description.
  6. Development of data model diagram “top down” from real-world description.
  7. Discussion of (and possibly practice with) alternative approaches and conventions.

2. Teach Data Modeling as a Design Discipline

Data modeling is difficult: there is a good argument that it is better characterized as a design discipline than a descriptive / inquiry discipline. Students who are able to quote quite sophisticated theory are frequently unable to tackle even the simplest non-contrived problems. Conversely, the theory makes more sense if it can be related to personal practical experience.

I suggest teaching data modeling as you would any design discipline (architecture is frequently used as a metaphor for data modeling). Provide plenty of practical examples, of increasing difficulty and show the students models developed by others.

In practice there is seldom a ‘single right answer’. An exercise that admits only one solution may be useful in teaching a particular technique, but is not representative of problems encountered in practice. As students move to building models from real situation descriptions (i.e. not written with a model solution in mind), they should be encouraged to compare answers across a range of quality dimensions.

3. Teach the Relevant Theory

There is a substantial body of theory in data modeling. Academic research priorities have not traditionally reflected the most pressing practical issues. While it is useful to provide an overview of research, students are likely to gain more from theory which directly illuminates the work they are doing – and that they are likely to be able to employ in practice.

There is a substantial body of literature around expert vs novice performance; methods, stages and deliverables; the relational model and normalization; and data warehouse design (this largely in the practitioner domain). This is likely to be highly relevant to students as they work on their assignments, and to their later practice.

Conversely, the very large body of literature on alternative models is worthy of note, but is often given attention out of proportion to its relevance. Similarly, there is some current research interest in application of ontology to data modeling – interesting to note but perhaps not worth pursuing in detail except as an elective option for the student at this level.

4. Use popular conventions

I strongly recommend that you use either the IE or UML conventions in teaching data modeling. Research has generally shown these (and the Chen E-R conventions) to be at least as useful as the more “fringe” approaches, which nevertheless have their advocates.

The downside of using alternative approaches such as ORM is that the student misses the opportunity to acquire proficiency with a language that is used in industry. In my experience working with modelers who have been trained in alternative approaches, any additional “deeper understanding” is more than outweighed by this missed opportunity.

Finally, I use diagrams heavily in teaching relational data structures. For various reasons writers on the relational approach (in particular Date) have preferred a tabular presentation to diagrams of structure. Experience in industry and teaching is that diagrams can significantly help students understand the key features of relational structures.

[1] Third Edition, co-authored with Graham C. Witt published by Morgan Kaufmann, San Francisco, 2005.