Why Level 2 Assessments Given Immediately After Learning Are Generally Dangerous and Misleading

Note from Will Thalheimer: This is an updated version of a newsletter blurb written a couple of years ago, where I made too strong a point about the dangers of Level 2 Assessments. Specifically, I claimed that Level 2 Assessments should never be used immediately after learning, which may have been pushing the point too far. Upon rethinking the issue a bit, I’ve concluded that Level 2 Assessments immediately after learning are still dangerous, but there may be some benefits to using them if the overall assessment process is designed correctly. You can be the judge by reading below.

Introduction to Levels 1 and 2 Assessments

Before we get to Level 2 evaluations, let’s talk about Level 1 of Kirkpatrick’s 4-level model of assessment. Level 1 is represented by the "smile sheets" that we hand out after training or include at the end of an e-learning course. They typically ask learners to rate the course and to judge how likely they are to use the information they learned. These evaluations are valuable to get learner reactions and opinions, but they provide a very poor gauge of learning and performance. The fact that we rely on these almost exclusively to assess the value of our instruction is unconscionable.

Level 1 Assessments are not always good predictors of learning. Learners may give a course a high score but not remember what they learned. Learners are also famously optimistic about what they will remember. Just because they tell us they’ll remember information and use it in their work doesn’t mean they will. Learners also fill in smile sheets based on whether they like the course or the instructor. Courses that challenge learners may be rated poorly, even though a challenge might be exactly what is needed to push a significant behavior change.

Level 2 Assessments are intended to measure learning and retention. We want to know whether the information learned is retrievable from memory. Ideally, we want to know whether the information is retrieved and used on the job. If we measure actual on-the-job performance, we’re really utilizing a Level 3 Assessment. In comparison, Level 2 Assessments measure the retrievability of information, not it’s actual use. This is where the problems start.

What is Meant Here by the Word Assessment?

First let me clarify that I am using the word "assessment" to mean a test given for the purposes of evaluating a learning intervention. Assessments can also be used to bolster learning, as when they are used to promote retrieval practice or provide feedback to the learners. It is the first use of assessments that I am concerned with in this article. Specifically I will argue that Level 2 Assessments given for the purpose of evaluating the success or failure of a learning intervention are dangerous if given immediately after the learning. However, this does not mean that assessments used for the purpose of aiding retrieval or providing corrective feedback are not valuable at the end of learning. In fact, they are excellent for that purpose.

The analysis in this article also assumes that Level 2 Assessments are well designed. Specifically, it assumes that the assessments prompt learners to retrieve from memory the same information that they will have to retrieve in their on-the-job situations. It also assumes that the cues that trigger retrieval will be similar to those that will trigger their thinking and performance on the job. It is true that most current Level 2 Assessments don’t meet these criteria, but they should.

The Problems With Immediate Assessments

When we learn a concept, we think about it. When we think about something, it becomes highly retrievable from memory, at least for a short time. Thus, during learning and immediately afterward, our newly learned information is highly retrievable. If we test learners then, they are likely to remember all kinds of stuff they’ll forget in a day.

This problem is compounded because learning is contextualized. If we learn in Room A, we’ll be better able to retrieve the information we learned if we have to retrieve it in Room A as opposed to Room B (by up to 55% or so). Thus, if we test learners in the training room or while they’re still at their desks using an e-learning program, we’re priming them for a level of success they won’t attain when they’re out of that learning situation and back on the job.

Giving someone a test immediately after they learn something is cheating. It provides an inflated measure of their learning. More importantly, it tells us very little—if anything—about how well learners will be able to retrieve information when they get back to their jobs.

On-the-job retrieval depends on both the amount of learning and the amount of forgetting.

Retrieval = Learning – Forgetting

Our instructional designs need to maximize learning and minimize forgetting. If we measure learners immediately after they learn, we’ve accounted for the learning part of the retrieval equation, but we’ve ignored forgetting all together. Not only are immediately-given Level 2 Assessments poor tools to use in measuring an individual’s performance, but they also give us poor feedback about whether our instructional designs are any good. In short, they’re double trouble. First they don’t measure what we want them to measure, and then they don’t hold us accountable for our work.

But What Happens On The Job?

All this is true in most cases, but there are complications when we consider what happens after the learning event. The analysis above is accurate in those situations when learners forget much of what they learn as they move from learning events back to their jobs. Look at the following graph and imagine doing a Level 2 Assessment at the end of the learning—before the forgetting begins. It would show strong results even though later performance would be poor.


But what happens in those all-too-rare situations when learners take the learning and begin to apply what they’ve learned as soon as they get back to the workplace? When they do this, they’re much less likely to forget—and they may even take their competence and learning to a higher level than they achieved in the actual learning event. Check out the graph below as an example.


In this case, if we did a Level 2 Assessment at the end of the initial learning—before the workplace learning begins, again the assessment wouldn’t be accurate. This time it might not adequately assess the ability of the learning intervention to facilitate the workplace learning.

Real learning interventions often generate both types of results. Learners utilize some of what they’ve learned back on the job—facilitating their memory; but the rest of what they learned is not used and so is forgotten. The following graph depicts this dichotomous effect.


So Why Use Level 2 Assessments At All?

It should be clear that Level 2 Assessments delivered immediately after the learning are virtually impossible to interpret. However, it may be useful to use them in conjunction with a later Level 2 Assessment to determine what is happening to the learning after the learners get back to the job.
If the learners’ level of retrieval improves, then we can be fairly certain that the learners have made positive use of the learning event. Of course, a better way to draw this conclusion is to use a comparison-group design, but such an investment is normally not feasible.

If the learners’ level of retrieval remains steady, then we can be fairly certain that the course did some good in preventing forgetting. Again, a comparison-group design will be more definitive.
If the learners’ level of retrieval deteriorates, then we can be fairly certain that the learning event did not prevent forgetting and/or was probably not targeted at real work skills. Deteriorating retrieval is the one result we want to make sure we don’t produce with our learning designs—and because forgetting is central to human learning—if we’re preventing forgetting, we’re doing something important. Finally, because forgetting is normal, a comparison-group design is not as critical in ruling out alternative explanations. In other words, if we find forgetting after the learner returns to the job, we can conclude that the learning event wasn’t good enough.

To summarize this section, it appears that though Level 2 Assessments given immediately after learning have dubious merit because they’re impossible to interpret, there may be value in using immediate Level 2 Assessments in combination with delayed Level 2 Assessments.

How to Do Level 2 Assessments

Level 2 Assessments should be utilized in the following manner:

1. Be administered at least a week or two after the original learning ends, or be administered twice—immediately after learning and a week or two later.

2. Be composed of authentic questions that require learners to retrieve information from memory in a way that is similar to how they’ll have to retrieve it on the job. Simulation-like questions that provide realistic decisions set in real-world contexts are ideal.

3. Cover a significant portion of the most important performance objectives.

March Madness, three weeks of college basketball tournaments in the month of March, has been estimated to cost U.S. companies 3.8 billion dollars of lost productivity.

Various estimates and commentary:

There can be upsides to such an energizing event as well, of course. What I wonder is whether any learning or organizational development initiatives might be wrapped around such events. Imagine the following:

  1. Company intranet portal offers March Madness links, and also includes corporate advertising of key business initiatives, strategic messages, or even training opportunities.
  2. Maybe company even makes links unavailable except through this central march-madness portal.
  3. Managers initiate business-critical conversations in staff meetings after highlighting the latest results for the office pool.
  4. Managers utilize March Madness frenzy for reward and recognition.

Anyway, those are just some ideas off the top of my head. I’d love to get your more thoughtful ideas in the comments. Better yet, have you seen any real-world implementations? Have they been successful?

The Bird Flu epidemic is coming. Perhaps. If it does come, and if it’s as bad as they say it might be, our lives, our families, our jobs, and our learning will all be disrupted. Here’s some of the headlines:

  • The Red Cross says to stock 2 weeks of food.
  • Schools will be shut down for up to 3 months.
  • Employees will stay home from work.
  • Some businesses may lay off their workforces.
  • The food supply might be cutoff temporarily.
  • Medical institutions may be overwhelmed.

NPR had a great radio segment on this. Check it out here.

The implications for business are almost incomprehensible. From world economic collapse, to laying off the workforce, to protecting employees, to enabling work from home, to utilizing training and development to limit the repercussions.

Training-Learning-and-Development could be vital in at least two ways: First, it could help mobilize and educate workers. Second, it could ramp up to provide extra learning services during the crisis. If workers can’t work, they can learn. Certainly, businesses will want the workers to work, but if they can’t, perhaps this is an opportunity for strategic, revolutionary organizational change. A time for reflection, learning, bonding, and helping others.

To learn more about the flu, you can check out Elliott Masie’s compilation of sources.

I’ve been reading Richard E. Clark and Fred Estes’ recently released book, Turning research into results: A guide to selecting the right performance solutions. They recounted research that shows that an expert’s knowledge is largely "unconscious and automatic" to them. In other words, experts have retrieved their knowledge from memory so many times that they’ve forgotten how they do this and how the information all fits together—the knowledge just comes into their thoughts when they need it. This is helpful to them as they use their expertise, but it makes it difficult for them to explain to other people what they know. They forget to tell others about important information and fail to describe the links that help it all make sense.

In the learning-and-performance field we often use experts as trainers. Clark and Estes suggest that when experts teach, their courses ought to be pilot tested to work out the kinks. In my experience as a trainer, I’ve found that the first few deliveries always need significant improvements. I learn by seeing blank stares, sensing confusion, receiving questions, and watching exercises veer off in the wrong direction. This has me thinking about synchronous instructor-led web-based training.

If it’s hard to create fluent classes in face-to-face situations, it’s going to be more difficult to do this over the web. We humans are hardwired to look in people’s eyes and read their expressions. Should we avoid having experts teach our courses? Probably not. Only experts have the credibility and knowledge to teach best practices.

What does this insight say about using synchronous delivery for quick information transfer? It means that it may not be effective to hook our resident expert up to a microphone and have them start talking. If they’re talking with other experts, they’ll be fine. But we ought to be skeptical about our experts’ ability to do ad-hoc sessions without practice.

How are our expert trainers going to get the practice and feedback they need to fix the problems their expertise creates? I’m sure you can create your own list of recommendations. Here’s mine:

1. Teach the course in a face-to-face situation first to work out the major bugs. This should not be used as a substitute for realistic online pilot testing. A variation is to use focus-group rooms with the instructor behind a one-way mirror. The instructor will see how the audience reacts, but will have to use the e-learning platform to deliver the learning. If technology allows, perhaps a small number of learners can be the target of webcams that enable the instructor to see their progress.

2. Beta-test the course online (at least once or twice) with a small number of learners who are primed to give feedback. Make sure they are encouraged to register their confusion immediately and allow lots of extra time for the instructor, learners, and observers to write down their confusions, discomforts, and skepticisms.

3. Make sure the learning design provides lots of opportunities for the learners to register their confusion and ask for clarification, especially the first few times the course is run. This can involve some sort of questioning of the learners, but the questions should be meaningful, not just superficial regurgitations.

Mathemagenic Processing

In the mid-1960’s, Rothkopf (1965, 1966), investigating the effects of questions placed into text passages, coined the term mathemagenic, meaning “to give birth to learning.” His intention was to highlight the fact that it is something that learners do in processing (thinking about) learning material that causes learning and long-term retention of the learning material.

When learners are faced with learning materials, their attention to that learning material deteriorates with time. However, as Rothkopf (1982) illustrated, when the learning material is interspersed with questions on the material (even without answers), learners can maintain their attention at a relatively high level for long periods of time. The interspersed questions prompt learners to process the material in a manner that is more likely to give birth to learning.

Although the term mathemagenic was hot in the late 1960’s and 1970’s, it gradually faded from use as researchers lost interest in the study of adjunct questions and as critics complained that the word was too abstract and had little meaning beyond the operations of the research paradigm.

Despite having fallen into disfavor, the term—and the research it generated—have proven invaluable. The adjunct-question research showed us that test-like events are useful in helping learners to bolster memory for the information targeted by the question and to stay attentive to the most important aspects of the learning material. The concept of mathemagenic behavior is very much a central component in the way we think about learning. Who could doubt today that it’s the manner in which learners process the learning material that makes all of the difference in learning.


Rothkopf, E. Z. (1965). Some theoretical and experimental approaches to problems in written instruction. In J. D. Krumboltz (Ed.). Learning and the education process (pp. 193-221). Chicago: Rand McNally.

Rothkopf, E. Z. (1966). Learning from written instructive materials: An exploration of the control of inspection behavior by test-like events. American Educational Research Journal, 3, 241-249.

Rothkopf, E. Z. (1982). Adjunct aids and the control of mathemagenic activities during purposeful reading. In W. Otto & S. White (Eds.) Reading expository material. New York: Academic Press.

What prevents people in the learning-and-performance field from utilizing proven instructional-design knowledge?

This is an update to an old newsletter post I wrote about in 2002. Most of it is still relevant, but I’ve learned a thing or two in the last few years.

Back in 2002, I spoke with several very experienced learning-and-performance consultants who have each—in their own way—asked the question above. In our discussions, we’ve considered several options, which I’ve flippantly labeled as follows:

  1. They don’t know it. (They don’t know what works to improve instruction.)
  2. They know it, but the market doesn’t care.
  3. They know it, but they’d rather play.
  4. They know it, but don’t have the resources to do it.
  5. They know it, but don’t think it’s important.

Argument 1.
They don’t know it. (They don’t know what works to improve instruction.)
Let me make this concrete. Do people in our field know that meaningful repetitions are probably our most powerful learning mechanism? Do they know that delayed feedback is usually better than immediate feedback? That spacing learning over time facilitates retention. That it’s important to increase learning and decrease forgetting? That interactivity can either be good or bad, depending on what we’re asking learners to retrieve from memory? One of my discussants suggested that "everyone knows this stuff and has known it since Gagne talked about it in the 1970’s."

Argument 2.
They know it, but the market doesn’t care.
The argument: Instructional designers, trainers, performance consultants and others know this stuff, but because the marketplace doesn’t demand it, they don’t implement what they know will really work. This argument has two variants: The learners don’t want it or the clients don’t want it.

Argument 3.
They know it, but they’d rather play.
The argument: Designers and developers know this stuff, but they’re so focused on utilizing the latest technology or creating the snazziest interface, that they forget to implement what they know.

Argument 4.
They know it, but don’t have the resources to use it.
The argument: Everybody knows this stuff, but they don’t have the resources to implement it correctly. Either their clients won’t pay for it or their organizations don’t provide enough resources to do it right.

Argument 5.
They know it, but don’t think it’s important.
The argument: Everybody knows this stuff, but instructional-design knowledge isn’t that important. Organizational, management, and cultural variables are much more important. We can instruct people all we want, but if managers don’t reward the learned behaviors, the instruction doesn’t matter.

My Thoughts In Brief

First, some data. On the Work-Learning Research website we provide a 15-item quiz that presents people with authentic instructional-design decisions. People in the field should be able to answer these questions with at least some level of proficiency. We might expect them to get at least 60 or 70% correct. Although web-based data-gathering is loaded with pitfalls (we don’t really know who is answering the questions, for example), here’s what we’ve found so far: On average, correct responses are running at about 30%. Random guessing would produce 20 to 25% correct. Yes, you’ve read that correctly—people are doing a little bit better than chance. The verdict: People don’t seem to know what works and what doesn’t in the way of instructional design.

Some additional data. Our research on learning and performance has revealed that learning can be improved through instruction by up to 220% by utilizing appropriate instructional-design methods. Many of the programs out there do not utilize these methods.

Should we now ignore the other arguments presented above? No, there is truth in them. Our learners and clients don’t always know what will work best for them. Developers will always push the envelope and gravitate to new and provocative technologies. Our organizations and our clients will always try to keep costs down. Instruction will never be the only answer. It will never work without organizational supports.

What should we do?

We need to continue our own development and bolster our knowledge of instructional-design. We need to gently educate our learners, clients, and organizations about the benefits of good instructional design and good organizational practices. We need to remind technology’s early adopters to remember our learning-and-performance goals. We need to understand instructional-design tradeoffs so that we can make them intelligently. We need to consider organizational realities in determining whether instruction is the most appropriate intervention. We need to develop instruction that will work where it is implemented. We need to build our profession so that we can have a greater impact. We need to keep an open mind and continue to learn from our learners, colleagues, and clients, and from the research on learning and performance.

New Thoughts in 2006

All the above suggestions are worthy, but I have two new answers as well. First, people like me need to do a much better job (me included) communicating research-based ideas. We need to figure out where the current state of knowledge stands and work the new information into that tapestry in a way that makes sense to our audiences. We also have to avoid heavy-handedness in sharing research-based insights, as we must realize that research is not the only means of moving us toward more effective learning interventions.

Secondly, I have come to believe that sharing research-based information like this is not enough. If the field doesn’t get better feedback loops into our instructional-design-and-development systems, then nothing much will improve over time, even with the best information presented in the most effective ways.

A new educational research organization is forming as a counterpart to AERA (the American Educational Research Association). I get the impression that the impetus for this is that too much of the current research on education has the flaw that it doesn’t really care about cause and effect.

Since cause-and-effect are critical—because it’s the only way to really find out what the critical factors are, I’m hoping for great things from this organization.

It’s name is: Society for Research on Educational Effectiveness

Let me tell you about a new program I’m offering. It’s called the Researcher in Residence Program.

It’s a 3-4 day program designed for companies or departments who have a bunch of instructional designers, developers, and/or trainers.

I come to visit. It’s part learning opportunity, part consulting engagement. Although the schedule can be customized, it might typically moves forward as follows:

Day 1: Workshop on the world’s preeminent learning research, led by me. Interactive. Challenging. Thought-provoking.

Day 2: Review of client programs and projects, with feedback from me. With sharing. With learning for me and for client.

Day 3: Special topics. Overall analysis of client strengths and blind spots. Client-led analysis and review. Review of key concepts. Action planning. Setting challenge agenda for client.

Day 4: Client leadership discussion. After-engagement planning.

Companies can choose to have me conduct learning audits to augment the experience. Such learning audits enable me to do in-depth reviews of the client’s learning programs before I arrive. This obviously costs more, but it’s helpful because it enables the discussions to go deeper.

I recently finished two Researcher in Residence Programs at two wonderful custom e-learning companies, Allen Interactions and Type A Learning Agency. If all e-learning was developed by folks like these, effectiveness would double or triple.

Although these were the first run-throughs for the Researcher in Residence Programs, they seemed to go very well. The Type A folks were so thrilled they posted a press release about the program and Michael Allen (Allen’s CEO and industry-leading e-learning guru) called the program "a very beneficial exercise" in his monthly ezine column.

If you’d like to explore the possibility of bringing the Researcher-in-Residence Program to your organization, email me or call me directly at 617-718-0067.

Will Thalheimer interviews Anna Belyaev and Gretchen Hartke of Type A Learning Agency in January of 2006.

This Segment is on journalism and the name Type A.

Questions Asked:

Given the miserable state of journalism in our field, what can we do to get better info to our fellow instructional professionals?

The name of your company is Type A Learning Agency. Who are you and what are you trying to say with that name?

MP3 File

NOTE: The full interview is available for download below the individual segments.

Will Thalheimer interviews Anna Belyaev and Gretchen Hartke of Type A Learning Agency in January of 2006.

This Segment provides an introduction and ask the first question on getting "leadership energy" into e-learning design.

Questions Asked:

Leadership gets things done in organizations. Most e-learning is devoid of this leadership factor. How can we design leadership into e-learning.

A Learning Agency. Who are you and what are you trying to say with that name?

MP3 File

NOTE: The full interview is available for download below the individual segments.