Introduction to Assessment
Michael Simonson, Ph.D.
Programs in Instructional Technology and Distance Education
Nova Southeastern University
What is Assessment?
If assessment is a way to measure learning gains, what can we do with that measurement? Are there uses for the information? Actually, there are at least five ways that this knowledge can be used to facilitate learning and many other uses that may indirectly influence the learning environment or help to formulate related policies. Some of the more "administrative" uses of assessment include program evaluation and improvement, justification for funding priorities, or reporting of long-term trends to state or federal entities. In a distance education environment, the results of assessment may sometimes be used to compare the academic performance of remote- site students to those at the origination site. This isn’t a particularly helpful comparison (we know that, even if we could control the confounding variables, the results would very likely be "no significant difference"). However it is often a necessary exercise to ease the worries of program administrators who must be able to "prove" that students at a distance are actually learning the material.
Probably the first use that comes to mind for student assessment is to enable the instructor to assign grades at the end of a course, unit, or lesson. While this is important (and, typically, is required of teachers), and often helpful in determining how to improve the instruction for future students, there are other, more direct, ways of enhancing teaching and learning.
Students gain a sense of control and can take on greater responsibility for their own learning if they know how well they're doing, compared to an established set of criteria. Frequent assessments, informal or otherwise, provide this scale. When the instructor has this information, he or she can provide remediation or correction where necessary, or determine if a student needs additional assistance. At the same time, this information helps the instructor to monitor the effectiveness of the instruction. If many students have difficulty with the same concept or skill, this could signal a lesson design problem. By using assessments carefully, the teacher can identify and address weaknesses or gaps in the instruction. When students encounter an assessment activity, they not only recall the needed concepts or skills, but reinforce them through application. This is especially important to remember if course content is highly sequential or hierarchical in nature. Frequent assessments help to emphasize the correct concepts and skills (necessary to advance through later material), and also pinpoint learner misconceptions that would eventually present obstacles to further progress. Finally, assessments are often a motivational activity. Most learners want to do well and knowing that they'll be held accountable for a body of knowledge or set of skills can be the nudge that keeps them on track. Many a teacher has adopted group discussions, pop quizzes, or in-class exercises to ensure that required readings or out-of-class assignments are completed on time.
Assessment and Evaluation
How does assessment differ from evaluation? These terms often are used synonymously, but in an instructional context they have different meanings and applications. Assessment, as explained above, denotes the measurement of progress toward a learning goal; for example, student performance compared to a desired level of proficiency at a task. In this chapter, the word assessment will be used specifically to refer to this process. Evaluation, on the other hand, suggests the attribution of significance or quality of the current status of a particular object or condition. As its root word, "value" suggests, evaluation implies that a judgment will occur regarding the information that assessment activities provide.
Summative evaluation of an instructional unit is a way of assigning value to a learning package; formative evaluation determines the level of quality of an unfinished product for purposes of revision and improvement. The techniques of assessment and evaluation may sometimes overlap, but their purposes and how the resulting data are used make them distinct.
Assessment and Instructional Design
The role of assessment in the instructional design process is as a corollary to the development of learning objectives, and a precedent to the development and implementation of instructional strategies (Dick and Carey, 1996; Gagne, Briggs, and Wager, 1992). In this way, the assessment activities are matched to expectations and instruction is then based on assessment plans. A less formal way of expressing this might be: Figure out what learners should get out of the instruction, determine how you'll know whether or not they were successful, and then decide what they should do to reach that point. In this manner, "teaching to the test" becomes a desirable basis for instruction, because the test is a measure of what's considered important.
Unfortunately, this ideal is realized only occasionally and assessments typically are created after the instruction is planned and often after it has been implemented. This doesn't preclude the use of objectives as a basis for determining progress but care must be taken to ensure that the instruction has also been based on the same expectations and has not wandered from the original goals. If students prepare for an examination thinking, "What's going to be on this test?" or face an out-of-class assignment wondering what's expected of them, that may indicate that the objectives have been forgotten along the way. Those desired outcomes must act as a continuous thread that binds the instructional process together from beginning to end.
Characteristics of Good Assessments
As described in the previous section, assessment is one element of the teaching and learning process that grows from the determination of desired performance outcomes. It follows, therefore, that one of the characteristics of a good assessment tool is that it matches the objectives; learners know what to expect because they've already been made aware of what's important and how they'll be expected to demonstrate their mastery of this knowledge or skill. The objectives, ideally, specify what the student will do to demonstrate their mastery of the content, how well they will be expected to perform this task, and under what special circumstances, if any.
Occasionally, an instructor will find that a test item or exercise that they’ve created doesn't match the objectives, although they believe it to be an important skill or concept for learners to grasp. In this case, it makes sense to return to the list of objectives and consider the possibility that there are gaps or missing items. Herein is an excellent reason for creating assessment measures before implementing instruction. If there are gaps in the objectives list that become apparent only when developing an assessment activity, it's likely that the missing material will not have been included in the planned learning activities. Instructional design is conducted in an interactive manner that allows for enough flexibility to revise throughout the development process, but revisions during implementation are often more difficult and can confuse learners.
An assessment may, on the surface, match the objectives but still not reflect the student's progress. This characteristic, the degree to which an assessment provides an accurate estimate of learning gains, is known as validity. If a learner who has mastered the specified body of knowledge does poorly on the test, exercise, or project intended to measure this mastery, then that assessment is an invalid instrument or activity. Conversely, if learners who have not mastered the material perform well, then that assessment is not considered a valid predictor of learning, either. For example, test items intended to measure analogical reasoning may, instead, reflect the learner's reading ability if vocabulary level is not considered in test design. Or, if a project is supposed to demonstrate the learner's understanding of due process in our judicial system, but an unrealistic time limit for completion is imposed, this assessment may indicate that some learners have not mastered the concepts, when in fact they simply were not given adequate time to demonstrate their expertise.
A concept related to validity, and an important characteristic of good assessments, is reliability. Reliability refers to the stability of an instrument or activity; this could be thought of as how consistently the assessment measures learning gains. If students perform poorly, as a group, on one occasion and then do much better later, the predictability of this assessment is called into question. Or, if learner mastery is measured by observation and scored by several different raters, the scores must be highly correlated to ensure consistency (also known as inter-rater reliability). Low reliability signals that the results are not dependable and could vary significantly from day to day.
A characteristic of good assessments that is especially important in distance education settings is clarity of expectations. This refers to how easy the assessment is for the learner to understand, whether the instructions are clearly written, and whether any special conditions are to be met. For many distant students, examinations will be proctored by someone other than the instructor ("I think this is supposed to be an 'open-book’ test."), projects may be completed based only on the directions initially provided ("Well, she said we could pick any topic!"), and papers written according to instructions given in the course syllabus ("Was that ten to twelve pages double- spaced or single-spaced?"). If the directions are not comprehensive, specific, and clearly worded, the assessment may ultimately prove useless.
Finally, while there are other criteria for judging the merits of a particular assessment, they all are designed to help answer the question, "Does this assessment measure learning gains and allow an accurate generalization of results beyond the immediate situation?" In other words, a useful assessment reflects the learner’s progress and understanding, as well as the transferability of skills and knowledge. The obvious purpose of an assessment is to document the direct results of instruction, but if the student performs a task in a learning environment but cannot do so elsewhere, the instruction has been futile.
Norm-referenced and Criterion-referenced Scoring
Once a student completes an assessment and it is scored or rated, there are two systems that are used to report this score. Criterion-referenced scoring is used when the rater compares the learner’s performance with that of a predetermined set of standards, drawn from the learning objectives. The rater asks, "Did the learner master the content?" The score that is reported reflects this level of mastery. Norm-referenced scoring, however, reports these same ratings by comparing each student to the others who’ve completed the same assessment. The term "grading on a curve" reflects this type of scoring, and in this case the rater asks, "How well did the learner do compared to the others?"
There are appropriate uses for each type of scoring. If our concern, as teachers or instructional designers, is that each student master the course content, then their performance compared to one another is irrelevant; we need to ensure that they’ve performed well compared to our expectations by using criterion-referenced scoring. Norm-referenced scoring is used, appropriately, to report long-term trends and comparisons of extremely large groups of learners (e.g., nationwide or worldwide, in the case of some university pre-admission examinations), but should never be used to determine grades or determine mastery of content. Grading "on a curve" only tells the teacher and learner how well students did relative to one another, but doesn’t give a clue as to whether anyone mastered any of the content.
Traditional Assessment Tools
When considering how to determine if a student has achieved the desired level of content mastery, one of the first thoughts that might come to mind is to administer some kind of paper- and-pencil test. In this section, four of the more frequently used test styles -- multiple-choice, true/false, short answer or free-response, and essay -- will be described and their particular strengths and applications discussed.
Multiple-choice tests can be an efficient way to measure learning, especially if the objectives are written at a low level of cognitive effort, such as knowledge or comprehension, where students are expected merely to recall previously memorized information (e.g., state capitols, vocabulary words, or bones of the human skeleton). As objectives move up the cognitive processing scale toward analyzing and synthesizing (inferring relationships or creating models), multiple-choice test items get more difficult and time-consuming to create. Unfortunately, in many cases, instructors expect students to gain higher order thinking skills but then test them using items that only reflect comprehension, or maybe application, of the content area.
When writing multiple choice items, the "stem" (usually written as a question or an incomplete statement) is presented along with one correct answer and several "distracters" (incorrect responses) in a list from which learners choose. Distracters should appear to be plausible answers and should include choices that are often confused with the correct answer, to ensure that students are discriminating among possible responses. When testing for higher-order thinking, it may be necessary to direct students to select the "best answer" from those presented, rather than assume that there is only one correct answer to any given question. "All of the above" and "None of the above" responses should be avoided; these items and others that are unclear or obscure can be confusing and result in a score that does not reflect actual learning. Multiple choice questions can also be written with one stem and a list of responses from which students choose all that are correct. Or, one set of responses can be used for several stems, with students selecting the correct response for each stem. In this case, there may be a mix of correct responses and distracters, and some of the responses might be appropriate answers for more than one stem.
One of the greatest advantages of multiple-choice testing is the ease of scoring and item analysis. These types of items (along with true-false) are considered "objective", because human judgment is unnecessary in scoring; i.e., an answer is either right or wrong and a subjective interpretation is not needed to determine the accuracy of each response. Where once nearly all multiple-choice tests were scored from optical scan forms ("bubble sheets") that were marked with the requisite #2 pencil, now inexpensive and easy-to-use testing packages make even this obsolete. With appropriate password protection and student verification (to ensure that the person taking the test is, in fact, who’s getting credit for it) these packages are a major leap forward for distant learners and their teachers. No more of the formerly inevitable turn-around wait time and massive stacks of papers to score and mark. However, as mentioned previously, multiple-choice tests are difficult and time-consuming to create unless the subject matter is at a very low level of cognitive difficulty. Other weaknesses include the possibility of students guessing correctly or responding to verbal associations that do not require understanding of the content.
Another type of objective test item is the true-false (sometimes called "alternative response") question. True/false items typically are used when objectives call for students to determine which of two possible responses is appropriate – whether a statement is true or false, fact or opinion, or some modification of this. One should use these items when there can be only two possible responses (yes or no, agree or disagree, valid or invalid, etc.), one of which is correct. If you find it difficult to create an unambiguous stem that clearly will have only two possible answers, consider using another type of test item. Like multiple choice items, true-false questions can be clustered by providing a stem that asks students to identify specific responses from a list of several possible answers. For example, the stem might ask, "Which of the following are synonyms for the word ‘synchronous’?" Responses to be marked as "yes" or "no" could include "concomitant, sequential, coincident, simultaneous, consecutive."
Some obvious advantages exist for true-false items: they provide for simple, easy-to-automate scoring; test administration can be handled more quickly (many items can be answered in a short amount of time); and results usually do not depend heavily on the reading ability of the student. Unfortunately, guessing offers the learner a 50% chance of success, and when the stem statement is false it can be difficult to determine if learners actually know the right answer, or if they just knew that the statement given was incorrect. (One way to mitigate this effect is to require test takers to re-write a false stem so that it becomes a true statement. This, in turn, reduces the ease of scoring, however.) True-false items, to be effective, must be precisely written and free of irrelevant information or verbal clues to the correct response. And, finally, avoid designing the test so that students may miss items, simply because they think they’re being "tricked" in some way. (The use of all "true" statements on a test, for example.) While this might test the strength of the learners’ convictions in their responses, it may not provide a realistic measurement of learning.