EFPM 311-Week 4 Questions

EFPM 311-Week 4 Questions / Pearson /

Brown-pg. 137-138

2. An NRT test would be most likely to result in student scores that range from low to high achievement, and thus these are often used to show differences in individual student abilities (in relation to each other) which is useful for placement testing. Brown suggests that most often we expect NRT scores to fall on a bell curve, with a few students at the top, a few at the bottom, and most in the middle. I believe that NRT type marking also sometimes occurs in marking CRTs which include a subjective element, like speaking or writing. Often, teachers have to ‘co-mark’ these types of testing items in order to normalize scores. Though this itself does not mean that students are marked against anything other than criteria set prior to testing, I know that in my experience these ‘norming’ sessions often include arguments about how one student could score so high when their paper is so similar to a student who scored lower. This could be a result of teacher subjectivity and ‘intuition’ when grading, but it could also result from criteria that is not specific enough (ex: writing criterion that states ‘use of appropriate vocabulary’). With these types of activities I often felt that the development of criterion was a battle between CRT testing where criteria was very specific and thus students were marked individually, or NRT testing where criteria that was broader and allowed teachers some freedom in assessing work more holistically but also led to comparisons between students.

3. On a CRT test, all students could conceivably score 100% if they had all learned the covered material, and thus these are the sorts of tests that teachers employ in their classroom to mark student progress through material outlined in the curriculum and on the syllabus. As mentioned above, I believe there are times when CRT tests become more like NRT tests in that teachers are allowed some freedom (a good thing!) and perhaps subconsciously compare students with each other (not so good!). Of course these types of slips in NRT territory are most likely to occur in testing items where there is room for interpretation (writing, speaking, open-ended questions, etc.) and CRTs are more clearly criterion related when they include items like multiple choice that are either right or wrong. One further overlap is that, while Brown links bell curves specifically with NRTs, teachers often use them to view grades for CRTs as well. It has always been my belief that it is desirable for students to fall along a bell curve for any assessment. If all students score high, the test is too easy. If all students score low, the test is too difficult. If students fall on a bell curve, the test is written at the right level. Perhaps I will have to rethink this and instead credit uniform high scores to excellent teaching and highly motivated students!

9. There are a number of factors that should be considered when assessing the quality of a test including the test type and purpose, the types of items included on a test, the test content and its appropriateness to the students (in terms of age, level, culture, prior learning, etc.), test reliability, test validity, and test practicality. In my experience, choosing (or writing) a test includes the often grueling task of trying to balance all of these factors though most often, in practice, things like item type, content appropriateness, and practicality are the most heavily weighed considerations. I think that things like reliability are considered in program wide tests (midterms/finals) though less often in classroom tests because teachers design their own testing instruments, adapt other’s tests for their class, or simply don’t compare tests across student groups. However, Brown stresses that language tests should be context specific, and I believe this to be the most important aspect. Tests that fail to account for the specific objectives of a program, and even more specifically to the context of a specific classroom, are likely to cause negative washback with teachers drilling students in preparation for the test rather than providing a well rounded learning experience where students are then tested on what they have learned. This can happen will widely used standardized tests, but can also happen within a program. For example, my program had several sections at each level (so, for example, level 3 had 5 sections and thus 5 teachers), and daily practice was done largely without consultation amongst level teachers. Even when teachers did consult, they often disagreed and thus left without a concensus. However, midterm and final exams had to be written and agreed by all level teachers. Thus, in the few weeks leading to exams, teachers drilled students on material that was included in the exam but which had not been covered in their classes (due to time constraints or modified schedules due to student needs).