You are here

Rethinking large-scale assessment

Lyn Shulha and Robert Wilson
Abstract: 

This paper examines the implications of using large-scale assessment results (a) to make judgements about student achievement of educational goals and (b) to provide educators with directions for improving teaching and learning. First, an exploration of the goals of education and how they are developed is outlined, followed by a description of large-scale assessment programmes.  The construct of validity in both a programme sense and an assessment sense is described, followed by an analysis of the degree to which large-scale assessment programmes adhere to the resultant criteria. All of this discussion is then related to the experiences of two classroom teachers as they confronted their own dilemmas in assessing their students’ achievement and their growth towards educational goals.  Finally, the paper analyses the twin practices of large-scale and classroom assessment and makes some recommendations for future investigation of how they fit together.

Rethinking large-scale assessment

Lyn Shulha and Robert Wilson

Abstract

This paper examines the implications of using large-scale assessment results (a) to make judgements about student achievement of educational goals and (b) to provide educators with directions for improving teaching and learning. First, an exploration of the goals of education and how they are developed is outlined, followed by a description of large-scale assessment programmes. The construct of validity in both a programme sense and an assessment sense is described, followed by an analysis of the degree to which large-scale assessment programmes adhere to the resultant criteria. All of this discussion is then related to the experiences of two classroom teachers as they confronted their own dilemmas in assessing their students’ achievement and their growth towards educational goals. Finally, the paper analyses the twin practices of large-scale and classroom assessment and makes some recommendations for future investigation of how they fit together.

Introduction

In Canada and internationally, a common educational agenda is to improve student performance on valued objectives and expectations (see, for example, Ministry of Education, 2007; Ontario Office of the Premier, 2009; Queensland Department of Education Training and the Arts, 2008; Scottish Government, 2009; United States Government, 2002). Concurrently, school systems invest in programmes of standardised large-scale assessment to monitor the extent to which their students achieve designated levels of performance (Klinger, DeLuca, & Miller, 2008). Newer large-scale assessment programmes also use interpretations of achievement data to construct reports for teachers and schools. This feedback is intended to shape local plans for the improvement of learning (see, for example, Education Quality and Accountability Office, 2009; Gilmore, 2008). A reasonable question to ask is: To what extent are large-scale assessment programmes and the interpretations they generate trustworthy for these twin purposes?

This paper examines the implications of using the results from large-scale assessment programmes (a) to make judgements about the achievement of valued objectives and expectations and (b) to provide recommendations on future directions for teaching and learning. To begin this examination we look at the array of decisions and activities rooted within many large-scale assessment programmes, including how achievement is typically defined. We then consider notions of validity, developed within the fields of programme evaluation and testing to weigh the implications of using large-scale assessment programmes to serve both achievement auditing and improvement purposes. This is followed by an account of two teachers’ responses to expectations that their assessment practices should be congruent with strategies and recommendations arising out of large-scale assessment. Finally, we discuss the implications for programmes of large-scale assessment and for teachers’ assessment practices.

Defining student achievement

What would we expect to see in a school system where there is concerted effort to raise levels of student achievement? In general terms, we would expect to see children becoming more proficient at realising what the system has declared to be valuable learning. The step of determining what is to be valued is thus groundwork for subsequent discussions about what it means to achieve. The values of a school system are readily identified in mission statements. These statements articulate educational goals that in turn describe in general terms the capacities that students are to develop during their time in school. They also set the parameters of the system’s accountability to stakeholders, including parents and students.

There is typically considerable input into the establishment of a school system’s educational goals. For example, in Canada, an infrequent but highly influential way to renew educational goals is through a Royal Commission. After extensive province-wide hearings with educators, parents, researchers and members of the community at large, the last Royal Commission in Ontario charged the Ministry of Education with focusing on “building on basic reading, writing, and problem-solving skills to ever-increasing stages, as well as ever-deepening degrees of understanding, across a variety of subject areas” (Royal Commission on Learning, 1994, pp. 4–5). One consequence of adopting this Commission’s recommendations was the need to monitor the extent to which this was happening across the province. In response, the Ontario Government created the Education Quality and Accountability Office (EQAO), an arms-length organisation charged with conducting provincial large-scale assessments and reporting the findings on these assessments to the public.

Input into the substance of educational goals also comes from nongovernmental organisations with a vested interest in shaping what is valued, taught and assessed by schools.1 For example, the Conference Board of Canada bills itself as the foremost independent, not-for-profit applied research organisation in Canada. Along with its affiliate, The Conference Board, Inc. of New York, it serves nearly 2,000 companies in 60 nations. The Conference Board first commanded attention in Canadian educational circles in the early 1980s by publishing its initial set of employability skills. Subsequently, under the auspices of an authoring panel that included, among others, representatives from four provincial ministries of education and five individual school districts, the Conference Board has published Employability Skills 2000+ (2000a) and the complementary Employability Skills Toolkit for Self-managing Learners (2000) (2000b). Three general sets of skills and attributes were identified as foundational for students leaving school:

• the skills needed as a base for further development (communicate, manage information, use numbers, think and solve problems)

• the personal skills, attitudes and behaviours that drive one’s potential for growth (demonstrate positive attitudes and behaviours, be responsible, be adaptable, learn continuously, work safely)

• the skills and attributes needed to contribute productively (work with others, participate in projects and tasks). (Conference Board of Canada, 2000a)

These are but two examples of the processes of consultation and petitioning that will often contribute to the creation of a school system’s educational goals. Pressures to revisit and renew these goals are intended to better align schooling both with what is valued by society and with the perceived needs of children. In Canada, the general attention presently being paid to a variety of literacies can be attributed, in part, to the recognition that today’s citizens must be prepared “to live in and contribute to a knowledge society” (Freiler, 2009, p. 15). A look at mission statements from three provincial educational systems demonstrates the nuances in creating these missions and goal statements:

The purpose of the British Columbia school system is to enable all learners to develop their individual potential and to acquire the knowledge, skill and, attitudes needed to contribute to a healthy democratic and pluralistic society and a prosperous and sustainable economy. (Ministry of Education [British Columbia], 2008, pp. D94).

The Ministry of Education, through its leadership, partnerships and work with the public—including stakeholders—inspires, motivates and provides the necessary tools and opportunities for every child to attain the knowledge, skills and attributes required for lifelong learning, self-sufficiency, work and citizenship. (Alberta Education, 2009a).

Mission statement for New Brunswick Schools: To have each student develop the attributes needed to be a lifelong learner, to achieve personal fulfilment and to contribute to a productive, just and democratic society. (New Brunswick Department of Education, 2009).

Once established, such goal statements provide both immediate and long-term cognitive, social and personal targets for the activities of schooling. In Canada they also serve as reference points for those charged with breaking down system goals into grade- and subject-specific knowledge, skill and dispositional expectations with accompanying standards of performance. As a set, these objectives and performance criteria comprise the curriculum—the blueprint for teaching, learning, assessment and achievement.

Once the curriculum is made explicit, achievement becomes defined as the extent to which each child is able to demonstrate the expected standard of performance on the key outcomes (see Alberta Education, 2009b). Achievement is most often measured through instruments constructed by classroom teachers. Recently, however, school systems have turned to large-scale standardised assessment programmes to provide more definitive information about how well students are achieving. In some systems, such as in New Brunswick, this has taken the form of testing each child, on several occasions, in language, maths and science (Klinger et al., 2008). British Columbia requires students to write at least five provincial examinations as part of a graduation requirement (Ministry of Education [British Columbia], 2009). In Ontario, large-scale assessments are undertaken primarily to monitor achievement and generate feedback in language and maths development. Ontario students complete Grades 3 and 6 literacy and numeracy assessments, a Grade 9 maths assessment and the Grade 10 Ontario Secondary School Literacy Test (Education Quality and Accountability Office, 2009). Each of these provincial large-scale assessment programmes claims to serve an important informing function for parents and the public and to provide information that can improve teaching and learning.

A characterisation of large-scale assessment programmes

Most large-scale assessment programmes begin with the assumption that subject disciplines provide the foundation for their work. For example, the international Trends in Mathematics and Science Studies (TIMSS) uses frequently taught content (number, geometric shapes and measures and data displays) and behaviours (knowing, applying and reasoning) to create a blueprint for the mathematics portion of the test. Another international test for reading, the Progress in International Reading Literacy Study (PIRLS), operates in much the same way for its test building in reading. Here, the content domain refers to two types of reading (literary and informational) and two behaviours (retrieving and interpreting).

In the United States, the No Child Left Behind programme was established by the federal government to provide funding for state educational programmes, with the proviso that students be tested every year. One example of what this legislation produced for assessment is the New England Common Assessment Program, a joint venture of the states of New Hampshire, Vermont and Rhode Island. The programme was developed specifically to meet the requirements of No Child Left Behind, and its tests measure achievement in four subject areas: reading; mathematics; writing; and science. As with TIMSS, the items are aimed at assessing common educational outcomes, in this case across three states rather than across countries. The goals of the assessment programmes are to provide information to stakeholders in the system and to “make these assessments instructionally relevant by providing information to school administrators, teachers, and parents to help them make informed decisions about student instructional needs” (Rhode Island Department of Elementary and Secondary Education, 2009, p. 1). Unlike the international tests, these tests have an ancillary aim to the accountability requirement: to help educators help their students achieve more in these subjects.

Traditionally, instruments used by large-scale assessment programmes have been administered by schools and teachers under clear and precise instructions to ensure comparability across students and sites. A New Zealand initiative is using computer technology in innovative ways to accomplish some of the goals held by government and other educational systems. The asTTle programme, Assessment Tools for Teaching and Learning (He Pūnaha Aromatawai mō te Whakaako me te Ako), is an educational resource for assessing literacy and numeracy (in both English and Māori) developed for the Ministry of Education by The University of Auckland (Ministry of Education, 2009). This programme is largely decentralised, however, with school-based educators selecting items for tests from a pool that serves their specific needs. The programme provides a large array of statistics for these items, which allow teachers and schools to compare their results to others as well as to curriculum outcomes and levels. In this programme the dominant use of large-scale assessment as an accountability vehicle is replaced by the focus on improvement, largely at a local level.

The involvement of educators in interpreting results is a feature of some programmes, especially those that have a standards-based as opposed to, or in addition to, a normative orientation. Various standard-setting recommendations exist, with no one approach dominating the discipline (Cizek, 2001). The necessity of a defensible method for setting standards is increased where the stakes are high (e.g., No Child Left Behind) and less so where the focus is on improvement (e.g., asTTle).

Thus, most large-scale assessment programmes traditionally have two purposes, accountability and improvement, but differ in their emphasis. The accountability orientation is most evident in the technical operations emphasising reliability and standardisation of administration, scoring and interpretation. These technical requirements limit the usefulness of the data for individuals and classes by providing largely homogeneous item types and routines for scoring. The improvement information comes from student responses to test items. Construction of the items is typically guided by a table of specifications that identifies both the content strands to be tested and the expected student behaviours (e.g., identify, apply). Although it is common for these tables to do a good job at sampling the content, there is typically less sampling of possible behaviours. These characteristics and limitations create issues for the usefulness and adequacy of tests constructed in these ways for these purposes. In the educational literature, these are known as issues of validity.

Constructs of validity in evaluation and assessment

Validity in programme evaluation

An ongoing challenge in conducting programmes of large-scale assessment is how to judge whether the programme has merit, worth and significance. Tied to this challenge is the task of how to select the warrants on which these judgements will be made. The field of programme evaluation continues to debate the qualities of valid evaluative judgements. Under consideration are questions such as:

• To what extent should evaluative judgements support social decision making?

• To what extent should they take into account considerations such as fairness, social responsibility and social consequence?

• To what degree should the method of inquiry (quantitative or qualitative) influence conceptions of validity (Norris, 2005)?

Surrounding these debates is the uneasy realisation that any authentic evaluation of a programme can produce results that are highly contentious. For example, findings might suggest the need to change the programme emphasis, or redistribute decision-making authority and resources. Any of these findings could have serious implications for those working within the programme. Thus the highly political nature of evaluation will also shape how stakeholders construe the validity of the judgements that ensue from a programme evaluation. It is for this reason that Norris (2005) reminds us that the ability to trust an evaluation’s account of a programme will rest squarely on the quality of the conditions, reasons and arguments used to construct that account.

These debates and considerations provide insights into the complexity of conducting a programme evaluation, but they do not—nor should they—deter evaluation. This is especially true when a programme draws on significant amounts of public funding and affects large numbers of captured school children. Once the decision to evaluate has been made, stakeholders and evaluators typically negotiate the evaluation purpose and the approach or model of evaluation that fits with that purpose. When the goal is to demonstrate public accountability (a usual obligation of large-scale assessment programmes), the attention of evaluation usually focuses on programme effectiveness. But it is at this point that another critical decision must be made: Will the evaluation implement a model that focuses on results or effects (Hanson, 2005)? In implementing a results model, the evaluation addresses directly whether the programme has been effective in achieving goals established within the programme. The evaluation of processes and outcomes is considered and examined only if they have some demonstrated relationship to these predetermined programme goals. This may be exactly what is required, especially if there are concurrent informing purposes for the evaluation such as comparing the actual programme intentions and outcomes with the planned ones. An alternative is an effects model of evaluation. This approach places an emphasis on assessing the consequences of implementing the programme and the degree to which the programme is serving a profile of demonstrated needs. In evaluation circles, this approach is labelled “goal-free evaluation” (GFE) (Scriven, 2005).

In programmes that are technically or logistically complicated, reach across multiple contexts and affect a number and variety of user and stakeholder groups, GFE has some particular advantages. Most large-scale assessment programmes appear to meet these criteria. Alkin (1972), in an early critique of GFE, suggested that the approach is appropriate only when the consequences of the evaluation are able to extend to policy formulators. Since large-scale assessment programmes are sustained by high-level policy decision makers, GFE appears to meet this criterion as well. Scriven argued that “The value of GFE does not lie in picking up on what everyone else ‘knows’, but in noticing something that everyone else has overlooked, or in producing a novel overall perspective” (Scriven, 1991, p. 59). In assessing for effects, GFE has the potential to see not only the intended effects but side effects as well.

In some programme contexts these side-effects are anticipated and match what programme planners and managers had imagined. For example, the fact that there was teacher resistance to the introduction of large-scale assessment in Ontario in 1996/97 was predictable given there was no existing culture of large-scale assessment and, on the face of it, teachers stood to take the bulk of the criticism should achievement results not meet the standards the programme had established for students. But more important for Scriven are the side effects that are not anticipated. Sometimes these are as positive and important as the intended effects. Again using the Ontario example, it was learnt by external evaluators that participating in the marking of students’ responses to the large-scale assessment items provided meaningful professional development for a significant majority of the teachers (Cousins & Shulha, 1997). In this instance, the evaluation made a case for the continued investment of resources into the teacher marking process and provided those working within the large-scale assessment programme with some needed positive teacher feedback.

The danger for programmes that do not consider the use of GFE is blindness to the presence and persistence of unexamined side-effects that have hidden and unacceptable negative consequences for stakeholders. For Scriven, this would include any derailment of the system from its responsibility to address the significant needs of programme consumers— in this case, students. What is clear from the effort and resources school systems invest in involving their constituents and stakeholders in defining educational goals, translating these goals into cognitive, social and personal curriculum expectation and then supporting teachers in their efforts to implement the curriculum is that the intention of schooling is to serve significant student needs. What is unclear is the extent to which large-scale assessment programmes—even ones highly effective at assessing students on a specified band of academic outcomes and presenting the results as comprehensible indicators of student achievement—are contributing to a school system’s efforts to be accountable in meeting these needs. A goal-free evaluator concerned about the validity of such a judgement would take great care to document all the effects of the programme, including how the programme might be contributing to or moderating the ability of the school system to attend to the full range of student needs.

Validity in the assessment of learning

Samuel Messick (1989), in his seminal chapter in Educational Measurement (3rd ed.), examined the construct of validity from a measurement perspective. Prior to the publication of Messick’s chapter, validity in measures was described in several ways. For educational tests, for example, “content validity”, the adequate sampling of the knowledge domain, was considered vital and often sufficient for a test to be considered valid. (Notice that this concept is most useful in constructing a test, but says little about how the results of the test might best be interpreted.) Messick argued that, for a variety of reasons, using only one particular aspect of validity might be misleading, and that “construct validity”, in which an argument was produced that the test measured a particular theoretical construct well and nothing else, was essential for claiming that a given measure was truly valid. If, for example, an educational test had content validity but in its operation unintentionally measured another construct, its results were likely to be misinterpreted.

Messick’s treatment of construct validity, however, extended beyond the psychometric qualities of measurement instruments. He maintained that the validity of any measure is also a function of the interpretations, decisions and social uses to which these instruments and their results are put. Simply stated, the consequences of measurement must be of value, or at least do no harm, for the instrument to have high construct validity. The other so-called validities typically used to give credibility to measures (e.g., content, predictive, concurrent) are ways in which aspects of constructs are validated, but none by itself, he argued, constitute the full treatment that construct validity provides.

Although Messick’s analyses of what is at the core of valid measurement created much research and discussion in academic circles (e.g., Moss, 1992), and in some applied ones (e.g., Linn, Baker, & Dunbar, 1991), they seem to have influenced the conduct of large-scale assessment practices and programmes only marginally, if at all. Those responsible for implementing and monitoring such programmes seem to assume the programmes they have committed to are good, that the outcomes are worth attaining and that they make positive differences to the people experiencing them and to the society paying for them. Another stance, typically taken by those responsible for implementing large-scale assessment programmes, is that educational outcomes have already been established by others, and that the purpose of their work is to determine the degree to which some of these outcomes have been attained.

An additional claim made by many of these programmes is that the information produced should be used by educators to improve performance. If this goal is one to which an assessment programme aspires, then it should be evident (a) in the way the assessments are constructed, (b) in the criteria given for task inclusion in the tests and (c) in the logic of the recommendations for how the results should be used. Test reliability (necessary for the accountability purpose) has typically been of fundamental concern for large-scale assessment, and this goal has produced a reliance for many programmes on short-answer items, objectively scored, which possess such psychometric qualities as high discrimination and moderate difficulty levels (e.g., TIMSS, PIRLS). This tradition of requiring specific responses to specific objectives emphasises a view of learning that is behaviourist in orientation: learning is the accumulation of learning bits more than the integration of learning concepts. The movement in some programmes to include more multistage tasks, combined with the involvement of educators in the scoring, would seem to be more in line with more modern conceptions of learning evident in many classrooms (e.g., asTTle 2007). When these types of items allow for demonstrations of a range of achievement, this too would indicate the possibility of using items for diagnostic purposes, especially if they were to be built using findings on assessing growth (Biggs & Collis, 1982; Fostaty-Young & Wilson, 2000; Wilson, 1996). Finally, many assessment programmes produce materials for educators to use in their classrooms and schools, largely taken from the tests and item analyses conducted on the students’ performance (e.g., Education Quality and Accountability Office, 2009). Providing these materials leaves open the question of how they get used, a question that is typically placed outside the boundary of large-scale assessment programmes.

Whether another approach to gathering information about learning could achieve these goals better or more completely is seldom asked. For example, most large-scale assessments in education are tightly constrained to the curriculum. The purpose of this alignment is to demonstrate high levels of content validity to the teachers who administer these tests, the students who take them, the parents who question them and the authorities who fund them. A consequence of this focus is that both the developers and the stakeholders are distracted from considering in any systematic way the extent to which there is validity in the assumed underlying construct of achievement.

For Messick and the authors of this article, this narrow orientation to validity is problematic. Many extra-test judgements are typically made from these large-scale achievement scores; for example, about teacher competence, programme quality and individual capacity to learn. There are also broader social and political consequences associated with administering such assessments and the publishing of findings. Results from international assessments, such as the Third International Mathematics and Science Study (TIMSS 1999, 2003, 2007), are typically used in predicting the market value of human capital by country as well as guiding national educational policy making (Barro & Lee, 2001; National Center for Educational Statistics, 2008). In the United States, within the framework provided by the educational policy No Child Left Behind, repeated failure by schools to meet state-wide proficiency standards can result in their closure (Smith, 2005).

A conclusion to this analysis would hold that large-scale assessment produces data on only a subset of important educational goals (albeit an important one) and that most programmes escape the rigorous evaluation recommended by authorities in the field, especially concerning unintended consequences. Perhaps by examining schooling in more holistic ways we might see what this emphasis means in practice. To that end, we will now illustrate how two teachers worked towards assessing students on learning outcomes consistent with educational goals. Next, we will analyse how achievement, as exemplified in these two contexts, distinguishes itself from the meaning of achievement embedded in large-scale assessment results. Finally, from this analysis we will discuss the implications for both programmes of large-scale assessment and teachers’ assessment practices.

A tale of two teachers

As part of a continuing programme of research into teachers’ classroom assessment practices, the authors initiated a routine of systematic inquiry with two doctoral students and two teachers, where the focus of attention was on dilemmas of each teacher’s choosing. David Notman taught several sections of business at various grade levels in an academically oriented high school. Bob Petrick, a Grade 6 teacher, worked in a self-contained classroom of 25–30 students, many of whom were high risk and had high needs. Both schools were located in a district school whose explicit mission was “to prepare students to face a changing world as life-long learners and informed responsible citizens” (Limestone District School Board, 2009).

As researchers, we had no pre-set questions for the investigation, although we were interested in the thinking that guided these teachers’ practices and the pressures, if any, they faced in trying to implement their thinking:

We began to call our method of inquiry ‘collaboration’. The desired outcome of this method, beyond any that could be applied to improving classroom practice, was the joint construction of meaning around particular facets of classroom assessment. This is not the same as a quest for a common meaning of classroom assessment practice. Efforts at meaning making required us to participate in constant iterative dialogue. It was primarily through such dialogue that each of us was able to access the implicit assumptions and tacit knowledge that shaped our individual as well as our group’s understandings, beliefs and practices … When there were differences in our thinking, it was a signal for either more dialogue or more data. The issue of interest determined whether this data was qualitative or quantitative. Differences in understanding, then, were not problematic and in need of resolution, but generative and in need of testing. (Shulha & Wilson, 2002, p. 661)

Notman and Petrick worked with us formally for three years, giving us full access to their classrooms, their students and their thinking. In return, we brought to the table newer notions of assessment, academic and practical resources and a willingness to work alongside them with their students. In reflecting on the project it would be accurate to say that the common theme underpinning each of their particular dilemmas was how to take their school system’s goals—goals that were congruent with their own values—and assess in ways that were congruent with the achievement of these goals.

The dilemmas that brought these two teachers into collaboration with us were quite different. Notman was frustrated by the fact that his current classroom assessment practices were incongruent with the teaching and learning environment that he was continuously trying to nurture. He seemed to be sabotaging his own efforts to get students to risk their ideas in approaching problems in business by regularly testing them on their acquisition of content and expectations mandated by the curriculum:

Everything we were learning together came to a screeching halt on test day. I would have students who wouldn’t even come into the room—who would hang around outside—making various claims about how they didn’t study and offering these up as badges of courage. I became the enemy. (Notman, 2002)

Eventually he could not ignore the fact that while he was accumulating marks for grades in conventional ways—ways that were consistent with conventional testing—this activity had a deleterious effect on his ability to help students achieve the learning that he worked so hard to nurture during the intervals between tests and exams.

At the same time, Petrick was facing the predicament of how to prepare students and parents for the EQAO large-scale assessments in language and maths literacy. He also had to think about preparing them for the results. The thought of large-scale standardised assessment distressed Petrick a great deal, but not because he had an ideology that assessments were bad. His fear was that a large majority of his high-risk, high-needs students would, predictably, score very poorly.

Petrick was experienced in working with this population of students and over the years had developed a tacit understanding of how to trigger trust and risk-taking behaviours in his students. He used this understanding to individualise attention and celebrate benchmarks of student growth and achievement along cognitive, social and personal dimensions of growth. His dilemma was how to prevent the judgements his students would receive from authorities outside his classroom from doing harm to these already fragile novice learners. He was deeply worried that achievement results from large-scale assessments would discourage parents and smother the enthusiasm for learning that he was so carefully nurturing.

Although their dilemmas were different, both teachers had come to view their decisions about assessment as the linchpin for quality teaching and learning. They also shared a commitment to types of learning not easily scored on either standardised instruments or teacher-constructed instruments designed using a measurement tradition. For them, devising different ways to assess curriculum expectations was not a complete solution. Their insistence on thinking about learning as a multidimensional rather than simply a cognitive experience, and on assessing in ways that could provide them with individual information on all dimensions of learning, provided us all with insights on how to proceed.

Both teachers were largely successful in resolving their dilemmas (Notman, 2000; Petrick, 2000). In doing so they also discovered age-appropriate ways to reinforce their system’s goal of having students become active and responsible members of a learning community. Notman was able to capitalise on his students’ gregariousness, natural enthusiasm for pop culture, need to demonstrate independence and apparently unlimited capacity to surprise. To do this he instituted personalised portfolios that, when presented, characterised even the most complex workings of their business of choice (e.g., Coca Cola, Nike, The Gap, Disney). The audience for student learning went beyond the teacher and included peers and parents. In fact, a requirement of the course was that parents attend a student-led conference and become familiar with the nature of the course, the variety of expectations related to learning and the ways in which their son or daughter was responding to these challenges.

The outcomes of this new course, in most cases, went well beyond curriculum expectations. This conclusion was not Notman’s alone. Parents who had previously accepted or rejected their opportunity for the semi-annual 15-minute progress report were observed spending 40–90 minutes listening and discussing with their adolescent what had been learnt and what it meant to learn in this class. Parents’ demonstrations of amazement, gratitude and enthusiasm came primarily from the opportunity they had been afforded to talk in a meaningful way with their adolescent. They appreciated being informed about the goals and learning expectations for the course, being included in a critiquing/assessment role for this learning, being asked to identify additional learning they valued in their child and negotiating with their child and the teacher ways they might support continued learning (Wilson, Shulha, & Notman, 1996). Virtually without exception, parents or significant adult surrogates attended these conferences.

Interviews with students also confirmed wide-ranging outcomes. Students offered reactions such as, “I learned how important it is to be able to ask, think, and be prepared to talk about what you know” and “This course is something you do for yourself. I learned how to manage a portfolio and my time.” What was reassuring for Notman was that these comments came from students who were aware that their final grades in the course were “good” but relatively low compared to most of their peers. Notman himself was delighted to observe that an important and not insignificant consequence of his new approach to assessment was that many more students were now able to meet most or all of the formal curriculum expectations, resulting in substantially higher grades (Shulha, 1998).

Petrick’s challenge was also multidimensional. He was already committed to teaching and learning practices that encourage cognitive, social and personal learning and individualised growth. He was looking for assurance that he was “on strong ground” in these practices. It was important to Petrick that all assessments be understandable to both students and parents. He was also wondering how he would connect the intentions underpinning provincial assessments, provincial rubrics and his assessment instruments and how he might resolve any apparent contradictions that might arise from these three sources of information about growth and learning.

Petrick’s efforts also led to a programme of self-assessment, peer assessment, portfolio development and student-led conferencing. An additional shift in his practice, however, was in the way he adapted his approach to continuous daily assessment of academic performance. He had always been more interested in the quality of student products than the completion of them. What he was able to explore during our collaboration was how to identify different levels in students’ cognitive responses to curricular tasks. He would observe whether students’ learning was focused on foundational ideas and rules, or whether they were making connections within the curriculum or between the curriculum and their lived experiences. Those focusing on information alone would be encouraged to make such connections. A more complex operation that he began to observe was his students’ abilities to extend their new understandings outside the context in which the learning had taken place.

This approach to understanding the quality of learner responses to complex tasks is the foundation of the Ideas Connections Extensions (ICE) taxonomy of assessment (Fostaty-Young & Wilson, 2000; Wilson, 1996). This practical guide for helping teachers to see learning in action is anchored by the works of Benner (1994), Biggs (1999), Biggs and Colis (1982), Bruner (1996) and Dreyfus and Dreyfus (1980). In Petrick’s class, students received direct instruction about what it meant to learn and how ICE could help them to monitor their learning. Students practised looking at the results of their efforts in these three ways. Soon, students became quite accurate in their ability to report how far they were progressing in understanding the constructs underpinning instructional units. Celebrations of students who “ICE’d” their work became commonplace (Petrick, 2000).

Another significant consequence of learning in this way was the capacity students developed for teaching their parents about the ICE model. In student-led conferences, the students would describe samples from their portfolios in terms of these levels of complexity. As a result, parents were able to develop a far more in-depth understanding of both what and how their child was learning and, with Petrick’s help, to identify ways to support this learning at home. By integrating this form of assessment into his approach to teaching and learning, Petrick could discuss on an individual basis not only “next steps” in a student’s learning but also the individually appropriate ways in which these steps could be taken.

What Petrick found in ICE was a form of assessment that allowed students who typically do not meet provincial standards of academic performance to understand what it meant to work towards these criteria. He wanted to implement assessment practices that helped students and their parents set reasonable goals while celebrating other dimensions of achievement along the way. In this way he was establishing learning as a lifelong enterprise rather than an episodic process capped off by a grade. Standards set outside of the classroom were seen by Petrick as useful targets, but on a daily basis comparison of student work to these targets took backstage to comparisons made to work done previously by the student. A richer understanding of both of these comparisons resulted when they were mapped onto evidence of each student’s social and personal growth during the reporting period. In this way Petrick, along with the students and their parents, defined what it meant to be a “responsible citizen” of the classroom community and put themselves in a position to make more complete and complex judgements about what it meant for them to achieve.

We were fortunate to be able to work with both Petrick and Notman for two years beyond the initial three-year agreement for collaborative inquiry. Both teachers became recognised leaders in thinking about assessment, leading workshops and collaborating with colleagues in school-based projects. Petrick was given a district award for the quality of his teaching practices; Notman was conscripted by provincial education officers to develop curriculum resources that would help teachers focus on the quality of their students’ learning.

The particular practices of large-scale and classroom assessment

The experiences of Petrick and Notman help to sharpen the contrasts between the assumptions underpinning teachers’ classroom assessment practices and large-scale assessment. Making judgements about achievement using instruments designed to assess only cognitive proficiency as defined by a curriculum, for example, misses many of the goals that various publics, classrooms and schools have for effective education. Creating uses for knowledge, developing expertise in social as well as academic contexts, approaching learning with positive personal values, social conscience and multiple perspectives are sampled poorly if at all by current large-scale assessment programmes.

Furthermore, we observed that, at least in this context, the fervour around gathering data, mainly from students, on the retention and application of a sample of the curriculum could easily distract attention from aspects of schooling that the teachers, students and parents we encountered valued most. These aspects included the documenting and reporting of how students learnt independently, with each other and with their teachers; attending to a variety of educational goals; and age-appropriate ways of reporting growth.

The individual nature of learning is necessarily truncated in large-scale assessments, mainly due to the conditions inherent in the practice. For example, because of the expense involved in mass assessments, a restrictive set of curriculum outcomes is assessed with largely restrictive responses. In addition, the feedback that typically follows from large-scale assessments links academic achievement to the quality of curriculum implementation. Subsequent school improvement plans then deal with renewing resources, changing instruction and promoting teacher professional development. All of these actions assume that (a) the curriculum as it was assessed by the large-scale assessment programme is what is most important for students to achieve and (b) the resulting recommendations, when implemented, will result in improved achievement on the curriculum as a whole. Even if student performance on these types of assessments does improve, whether such improvements are “beneficial” to students can only be answered in light of evidence collected using Scriven’s goal-free evaluative approach and taking into account both Messick’s and Norris’s social utility and significance criterion. Nevertheless, these assessments are deemed to be authoritative, objective and uncontaminated by educators’ biases. Thus they are given a high status for making judgements about the quality of education.

Towards valid assessment

In large-scale assessment, few outside the community of technical experts are asking the assessment specialists about the validity (in both the measurement and evaluative senses) of their practices. In part this is because few stakeholders, including teachers, possess more than naīve notions about which questions need to be asked. As a consequence, it is easy to make judgements and interpretations that are not warranted by the qualities of the information gathered.

Large-scale assessments are best used to show the degree to which core curriculum has been implemented and core cognitive outcomes have been achieved. They cannot be concerned with parts of the curriculum that are difficult to assess, nor with whether the culture and context of a particular school have led to prioritising some curriculum outcomes over others. When the distance between the creation of these items and the students responding to them is large, little information can be collected on how individual students are progressing towards cognitive, social or personal goals. Many of these goals can be more accurately and adequately assessed at the classroom and school level. Often, though, because the more formal assessments used in the large-scale arena are more highly valued, teachers feel compelled to model their assessment practices after them, and have their judgements of student achievement match those obtained through the external model. This discussion has hopefully brought forward some of the dangers in promoting this congruence.

Within the various educational communities there is rarely time or space for reasoned deliberation of questions concerning the role of particular assessment processes and practices and their fit to each other. Questions such as “Should achievement of the academic curriculum be the only outcome of schooling to be taken seriously?” are lost in the development of such issues as test-equating procedures (in large-scale assessment) and in the preparation of tests, exams and report cards (in classrooms and schools). Whether the context is large-scale assessment or classroom assessment, our studies suggest that when the primary goal is to measure achievement of particular curriculum expectations, there is a narrowing of both what is assessed and the tools used for assessment. The goals of education, no matter how holistic in theory, become truncated. The learning that is measured is that which is firmly anchored within the subject disciplines and is resistant to the effects of students engaging in it. In fact, inquiry into the fit of the attained curriculum (as measured by both large-scale and classroom assessments), the published curriculum and the goals of education would be a reasonable place to begin an investigation of the validity of all assessment practices.

Few would dispute that classrooms where the teacher is the ultimate authority, where students are expected to reproduce codified knowledge rather than work with it and where risk taking, tolerance and equitable treatment are not honoured are poor educational environments. This judgement would be the same whether or not the students achieved high marks on teacher-made tests or scored well on large-scale assessments. Fortunately, few educators see themselves as serving only narrow conceptions of achievement. But do we really know what is being served by current assessment practices? There are risks in raising such a question. Questions about the validity of a practice do not take the same form as questions intended to improve that practice. They are less concerned with refining procedures and more with understanding immediate and long-term intended and unintended consequences. We have learnt from the contexts described above that those willing to risk asking such questions are motivated by a strong desire to know that the way they assess actually makes a meaningful contribution to meeting the needs of students. Restricting the discussion to how to conduct present practice more effectively misses the opportunity to widen the view to encompass the full range of goals of the entire enterprise.

References

Alberta Education. (2009a). 2008–2011 business plan. Retrieved 1 February 2009, from http://education.alberta.ca/department/businessplans.aspx

Alberta Education. (2009b). Student achievement and testing. Retrieved 13 June 2009, from http://education.alberta.ca/resources/backtoschool/testing.aspx

Alkin, M. C. (1972). A classification scheme for objectives based evaluation systems. Evaluation Theory Program. Los Angeles: UCLA Graduate School of Education.

Barro, R. J., & Lee, J-W. (2001). International data on educational achievement: Updates and implications. Oxford Economic Papers, 3, 541–563.

Benner, P. (1994). From novice to expert: Excellence and power in clinical nursing practice. Reading, MA: Addison-Wesley.

Biggs, J. B. (1999). What the student does: Teaching for enhanced learning. Higher Education Research and Development, 18(1), 57–75.

Biggs, J. B., & Collis, K. F. (1982). Evaluating the quality of learning: The SOLO taxonomy. New York: Academic Press.

Bruner, J. (1996). Frames for thinking: Ways of making meaning. In D. Olsen & N. Torrance (Eds.), Modes of thought (pp. 93–105). New York: Cambridge University Press.

Cizek, G. J. (Ed.). (2001). Setting performance standards: Concepts, methods, and perspectives. Mahwah, NJ: Lawrence Erlbaum.

Conference Board of Canada. (2000a). Employability skills 2000+. Retrieved 22 August 2008, from http://www.conferenceboard.ca/topics/education/learning-tools/employability-skills.aspx

Conference Board of Canada. (2000b). Employability skills toolkit for self-managing learners (2000). Retrieved 22 August 2008, from http://www.conferenceboard.ca/topics/education/learning-tools/toolkit.aspx

Cousins, J. B., & Shulha, L. M. (1997). A perspective on the marking process: Results of a focus group interview with teacher markers from the provincial Grade 3 math and language assessment. Toronto: The Federation Cooperative and the Education Quality and Accountability Office.

Dreyfus, S. E., & Dreyfus, H. L. (1980). A five-stage model of the mental activities involved in directed skill acquisition, storming media. Retrieved 22 August, 2008 from http://www.stormingmedia.us/15/1554/A155480.html

Education Quality and Accountability Office. (2009). Parent resources. Retrieved 12 June 2009, from http://www.eqao.com/Parents/parents.aspx?status=logout&Lang=E

Fostaty-Young, S., & Wilson, R. J. (2000). Assessment and learning: The ICE approach. Winnipeg: Portage & Main.

Freiler, C. (2009). From human capital to human development: Transformation for a knowledge society. Education Canada, 49(2), 15.

Gilmore, A. (2008). Professional learning in assessment. Report to the Ministry of Education for the National Assessment Strategy Review. Wellington: Ministry of Education.

Hanson, H. F. (2005). Choosing evaluation models: A discussion on evaluation design. Evaluation, 11(4), 447–462.

International Baccalaureate. (2008). Mission and strategy. Retrieved 21 November 2008, from http://www.ibo.org/mission

Klinger, D., DeLuca, C., & Miller, T. (2008). The evolving culture of large-scale assessments in Canadian education. Canadian Journal of Educational Administration and Policy, 76.

Limestone District School Board. (2009). Statement of beliefs. Retrieved 11 February 2009, from http://www.limestone.on.ca/Board/S034086FA-03CB3E67

Linn, R. L., Baker, E. L., & Dunbar, S. B. (1991). Complex, performance-based assessment: Expectations and validation criteria. Educational Researcher, 20(8), 15–21.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York: Macmillan.

Ministry of Education. (2007). The New Zealand curriculum. Wellington: Learning Media.

Ministry of Education. (2009). AsTTle: Assessment tools for teaching and learning (He pūnaha aromatawai mō te whakaako me te ako). Retrieved 13 June 2009, from http://www.tki.org.nz/r/asttle/index_e.php

Ministry of Education [British Columbia]. (2008, November). Statement of education policy order: Mandate for the school system. School Act, Section 169(3) D92–D98.

Ministry of Education [British Columbia]. (2009). Graduation program requirements. Retrieved 1 February 2009, from http://www.bced.gov.bc.ca/resourcedocs/k12educationplan/mission.htm

Moss, P. A. (1992). Shifting conceptions of validity in educational measurement: Implications for performance assessment. Review of Educational Research, 6(2), 229–258.

National Center for Educational Statistics. (2008). Highlights from TIMSS 2007: Mathematics and science achievement of US fourth- and eighth-grade students in an international context. Retrieved 22 August 2008, from http://nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2009001

New Brunswick Department of Education. (2009). Mission statement. Retrieved 20 June 2009, from http://www.gnb.ca/0000/about-e.asp

Norris, J. (2005). Validity. In S. Mathison (Ed.), Encyclopaedia of evaluation (pp. 439–442). Thousand Oaks, CA: Sage.

Notman, D. (2000, May). Another way of coming down stairs: How portfolio assessment influences issues of ownership and control. Paper presented at the annual meeting of the Canadian Society for Studies in Education (CSSE), Edmonton, Alberta.

Notman, D. (2002, April). Preparing to assess. Presentation to pre-service education candidates, Faculty of Education, Queen’s University, Kingston, Ontario.

Ontario Office of the Premier. (2009). Making student achievement our top priority. Retrieved 29 May 2009, from http://www.premier.gov.on.ca/news/event.php?ItemID=5931&Lang=en

Organization for Economic Co-operation and Development. (2005). OECD publication identifies key competencies for personal, social and economic well-being. Retrieved 12 July 2008, from http:/www.oecd.org/document/50/0,3343,en_2649_33723_11446898_1
_1_1_1,00.html

Petrick, R. (2000, May). The relationship between learning and assessment: Views of an elementary classroom teacher. Paper presented at the annual meeting of the Canadian Society for Studies in Education (CSSE), Edmonton, Alberta.

Queensland Department of Education Training and the Arts. (2008). Statement of affairs. Retrieved 29 May 2009, from http://74.125.95.132/search?q=cache:Ffgo7jJOE6gJ:education.qld.gov.au/publication/production/
reports/pdfs/statemnent-of-affairs-o8.pdf+Queensland+Department+of+education+and+the+arts.(2008).+Statement+of+affairs&cd=1&hl=en&ct=clnk

Rhode Island Department of Elementary and Secondary Education. (2009). New England Common Assessment Program (NECAP). Retrieved 1 June 2009, from http://www.ride.ri.gov/Assessment/necap.aspx

Royal Commission on Learning. (1994). For the love of learning: Short version. Toronto: Queen’s Printer for Ontario.

Scottish Government. (2009). Curriculum for excellence. Retrieved 29 May 2009, from http://www.scotland.gov.uk/News/Releases/2009/05/14093138

Scriven, M. (1991). Pros and cons about goal free evaluation. Evaluation Practice, 12(1), 55–76.

Scriven, M. (2005). Goal-free evaluation. In S. Mathison (Ed.), Encyclopedia of evaluation (p. 171). Thousand Oaks, CA: Sage.

Shulha, L. (1998, April). Students’ responses to portfolio development and student-led conferencing. Paper presented at Measurement and Evaluation: Current and Future Research Directions for the New Millennium, Banff, Alberta.

Shulha, L., & Wilson, R. (2002). Collaborative mixed-method research. In A. Tashakkori & C. Teddlie (Eds.), Handbook of mixed methodology (pp. 639– 701). Thousand Oaks, CA: Sage.

Smith, E. (2005). Raising standards in American schools: The case of No Child Left Behind. Journal of Education Policy, 20(4), 507–524.

United States Government. (2002). No Child Left Behind. Public Law 107–110.

Wilson, R. J. (1996). Assessing students in classrooms and schools. Scarborough, ON: Allyn & Bacon.

Wilson, R. J., Shulha, L. M., & Notman, D. (1996, September). Portfolio assessment and student-led conferences at the secondary level. Paper presented at the annual meeting of the Ontario Educational Researchers’ Council, Royal York Hotel, Toronto, Ontario.

Note

1 For other examples, see International Baccalaureate, 2008; Organization for Economic Co-operation and Development, 2005.

The authors

Lyn Shulha is Professor of Education at Queens University, Kingston, Ontario, Canada.

Email: Lyn.Shulha@queensu.ca

Robert Wilson is Professor of Measurement and Evaluation in the Faculty of Education at Queens University, Kingston, Ontario, Canada.

Email: wilsonr@queensu.ca