Jaryga 1

Mallory Jaryga

Education Evaluation

October 16, 2005

Module 4 – Journal Entry #3

McMillan suggests seven criteria for establishing high-quality classroom assessments: validity, reliability, selecting appropriate methods, clear and appropriate learning targets, practicality and efficiency, positive consequences, and fairness. Let us first focus validity and reliability before moving on to the others.

Validity, as defined by McMillan, “is a characteristic that refers to the appropriateness of the inferences, uses, and consequences that result from the assessment.” In other words, this means that the interpretations made by the assessor from the test must be reasonable. McMillan also states that validity does not deal with the test itself, but instead with the consequences of the inferences made from the test. Therefore, it is inaccurate to talk about “the validity of the test”; a test cannot be invalid or valid, but and inference made from a test, instrument, or procedure that is used to gather information can be deemed valid or invalid.

Three major sources of information can be used to establish validity: content-related, criterion-related, and construct-related evidence. The categories themselves do not effectively address the uses and consequences of results, but we can consider how classroom teachers can uses these types of evidence along with consideration of the consequences and uses of each type of evidence, to make a judgment as to their degree of validity concerning the assessment.

Content-related evidence is defined as “the extent to which the assessment is representative of the domain of interest.” This type of evidence is representative of what the students knows in respect to a certain whole, like a unit or day of instruction. Content-related evidence leads to the inference from the test that the student demonstrates knowledge about a unit, lesson, etc. To more accurately create a content-related evidence test or task, the teacher might consider making a test blueprint of table of specifications to specifically lay out what objectives should be covered on the test. This blueprint or table shows the content and the types of learning targets represented in the assessment. The goal of the table or blueprint is to “systematize your professional judgment so that you can improve the validity of the assessment.

A teacher dealing with this type of evidence must also consider instructional validity, or does what was taught match what will be assessed? Asking this questions and creating an assessment that positively answers it will lead to making reasonable inferences about the students’ performances, or increasing validity.

Criterion-related evidence relates an assessment to some other measure (criterion) that either demonstrates an estimate of current performance (concurrent criterion-related evidence) or predicts future performance (predictive criterion-related evidence). The principle here is that when a classroom teacher has two or more measures of the same thing, that both measures provide similar results. If two of measures of the same thing do not yield similar results, criterion-related evidence has not been shown and validity of one or both measures would be in question. If a correlation is found between the measures, than inferences made about future performance would be valid.

Lastly is construct-related evidence. A construct is “an observable trait or characteristic that a person possesses.” A construct cannot be measured directly like a performance task, but is constructed. Whenever these constructs are assessed, the validity of our inferences depends on the construct-related evidence that is presented.

There are three types of construct-related evidence: theoretical, logistical, and statistical. Theoretical explanation or definition of the characteristic to be assessed must be accurate in order for the evidence to be valid. There can be several types of logistical analysis like students commenting on their thoughts during a certain task or comparing scores of groups. Statistical procedures can be used to correlate scores from different measures assessing the same construct, or correlating scores from the same measure assessing different constructs.

Before moving on to reliability, McMillan offers several more suggestions for enhancing validity: ask others to judge the clarity of what you are assessing, sample a sufficient number of examples of what is being assessed, compare groups, scores, and predicted consequences, provide adequate time to complete the assessment, ensure appropriate vocabulary, sentence structure, and item difficulty, ask easy questions first, use different methods to assess the same thing, etc.

Another way to ensure validity of an assessment would be to create a Taxonomy Table of Learning Targets, like the one created in this module for our units. The Taxonomy Table would, in effect, act like the blueprint or table of specifications that McMillan outlines; the Taxonomy is a way to organize learning targets and process to create an assessment that will accurately assess the students’ abilities related to the specific learning target as well as the learning process.

Reliability concerns “the consistency, stability, and dependability of the scores” of an assessment. Reliable results show similar performance at different times. No matter how appropriate or well-chosen the measurement of assessment, there is always error. Assessment error is the observed score or result when we assess something. The observed score is the product of the true score, or real ability or skill, plus the degree of error. McMillan abstractly represents assessment error as such:

Observed Score = True Score + Error

Reliability can be determined by “estimating the influence of various sources of error.” In the classroom, reliability is determined by being aware of sources of error and observing how consistent students are at answering questions concerning the same topic.

The five major sources or reliability evidence are: stability, equivalence, internal consistency, scorer or rater consistency, and decision consistency. Evidence based on stability refers to “consistency over time.” This type of evidence can also be called test-retest reliability; this evidence is produced by testing a group of individuals, waiting a specified amount of time, and then retesting the same individuals with the same assessment. The correlation between the two sets of scores can be used to measure stability, and therefore, reliability.

Evidence of reliability based on equivalent forms is found by giving two different forms of assessment to the same group of students and them correlating the results. If a significant time delay is also applied with this technique, then the feature of stability is added thus offering an “equivalence and stability estimate.” Teachers rarely use this type of evidence to determine reliability as it is time consuming.

McMillan then turns to internal consistency, or “a single administration of the assessment.” This type of evidence is based on the level of correlation on items or tasks measuring the same trait. Only one form of assessment is used here and the theory is that the scores measuring the same traits should give consistent, reliable results. Internal consistency is somewhat easy to obtain as only one form of assessment is needed, but there should be a sufficient number of items from which to assess the student and there should be no time limit.

Error can be caused in assessment by the lack of scorer or rater consistency. One scorer would be the best possibly solution, but since this is not always possible, all raters should be agreed on the criteria and evaluations to limit error. Also, a good variation of products to evaluate helps eliminate some error.

Lastly, there is evidence based on decision consistency. This evidence can be used in conjunction with rater consistency as both have to do with the act of scoring assessments. This error could be greatly lessened by clear delineations of student expectations in the form of a rubric or other evaluating aid. If the teacher or scorer has a transparent view of what he or she is assessing for and on, then the amount of error will be decreased than if he or she is evaluating assessments with no apparent guide.

Along with the different types of evidence that, when done appropriately, yield reliability, there are obviously factors that influence reliability estimates. Some factors are the range or spread of scores, the number of items on the assessment, and the amount of objectives scored. McMillan offers suggestions for enhancing reliability that go hand-in-hand with these factors: use a sufficient number of items or tasks, use independent raters or observers, make certain of objectivity of assessment procedures and scores, eliminate or reduce influence of irrelevant factors, and use short assessments more frequently than fewer longer ones.

McMillan also states that “clear and appropriate learning targets” is another way of ensuring high-quality assessments. Learning targets include both what students know and can do and the criteria for judging student performance. If the learning targets for the unit are “clear and appropriate” then that starts the teacher off to a good start in assessing his or her students.

Next on the list is the “appropriateness of the assessment method.” There are several different methods of assessment and “particular methods are more likely to provide quality assessments for certain types of targets. Once the teacher has created the learning targets, the next step is to match an appropriate assessment method to the learning target. So, if the teacher can choose an appropriate method to assess his or her students, then the students will have a better chance at accurately demonstrating what they know.

“The nature of classroom assessments has important consequences for teaching and learning.” A student will be affected by the type, appropriateness, etc. of the assessment being given. A positive consequence, or the appropriate match between the learning target and the chosen assessment task, will, hopefully, positively affect the student and most certainly increase the level high-quality assessment in the classroom.

Fair assessments, or assessments that give all students an equal opportunity to demonstrate achievement and yield comparable valid scores from one student to the next, will also enhance high-quality assessment. A fair assessment is unbiased, nondiscriminatory, and should be evident of what the students are told ahead of time of what knowledge they will be assessed.

Practicality and efficiency should, for obvious reasons, be included in the list for evaluating high-quality classroom assessments. The teacher must consider time efficiency, for his or herself in creating and assessing the assessment, as well as for the student. Ease or scoring, ease of interpretation, and cost are all considerations that the teacher must make in deciding the practicality and efficiency of an assessment.