Dateline: This article will be updated periodically. As presented here, it is in its first iteration. I invite you to share your ideas and comments below.
= Will Thalheimer
I’ve been in the workplace learning field for over 30 years, have made a lot of mistakes myself and have seen other mistakes get made over and over. In the last decade, as I’ve turned my attention more and more to learning evaluation, I see us making a number of critical mistakes. Because the biggest problem with these mistakes is that we continue to make them—often without realizing our errors—I aim to capture a list of common evaluation mistakes here, and update the list from time to time. I welcome your ideas. In the comment section below, please add your thoughts. Thanks!
Common Evaluation Mistakes
Listed in no particular order… and with common themes sometimes repeated across items…
When Measuring Learner Perceptions
- We rely on smile sheets that only tell us about learner satisfaction and course reputation—they don’t tell us enough about learning effectiveness.
- We rely on smile sheets as our only metric.
- We look at our smile sheet results and forget that what we’re seeing is not all that might be seen. That is, we may not realize that our results might be neglecting critical learning results such as learners’ comprehension, their motivation to apply what they’ve learned, their ability to remember, their success in applying what they’ve learned, etc.
- We ask learners about their learning, about their on-the-job performance, and about organizational results; and think we’ve actually measured learning, on-the-job performance, and organizational results—but we only have learners’ subjective opinions about these constructs.
- We ask learners questions they won’t be good at answering. (For example: “What percentage of your learning will you use in your job?” “Did the learning help you achieve the learning objectives?” “Did your instructor help you learn?”).
- We use Likert-like scales and numeric scales, both of which are too fuzzy to enable good respondent decision-making, to motivate attention to the questions, and to create results that are clear and actionable.
- We don’t often use after-training learner surveys to get insights into learning application.
- We use affirmations in our questions, biasing our results toward the positive.
- In using Likert-like scales, we put the positive choices first, biasing responses toward the positive.
- We don’t follow-up with learners to let them know what we’ve learned and the design improvements we’ve been able to target based on their feedback.
- We don’t attempt to persuade our learners of the importance of the learner surveys we are asking them to complete.
- We don’t use our survey questions as opportunities to send stealth messages to our key stakeholders about important learning-design imperatives.
Biases in Measuring Learning
- We measure learning in the learning context where learners are artificially triggered by contextual stimuli that help them remember more than they’ll remember when they are in a different context—for example, at their worksite.
- We measure learning near the end of learning, when learners have a relatively easy time remembering—so we fail to measure our learning interventions’ ability to minimize forgetting and support remembering.
- We measure low-importance learning metrics (like knowledge questions) rather than learning as represented in realistic decision-making and task performance.
Failing to Measure Learning Factors
- When we focus on measuring on-the-job performance and/or business results WHILE NEGLECTING to measure learning factors, we create for ourselves an inability to figure out how to improve our learning designs.
- When we don’t measure learning factors on a routine basis, we leave ourselves in the dark, we make it impossible to create a cycle of continuous improvement, and we are essentially abdicating our responsibility as professionals.
Failing to Compare Learning Factors
- We rarely, if ever, compare one learning method with another, as marketers do, for example in A-B testing.
- Even in elearning, where it wouldn’t be too difficult to randomize learners over different methods, we fail to take advantage.
Not Seeing Behind the Pretty Curtain
- We too often get sucked into gorgeous data visualizations without appreciating that the underlying data might be misleading, worthless, irrelevant, etc.
- Dashboardism is a version of this. If it looks sophisticated, we assume there is intelligence underneath.
- Big data and artificial intelligence may hold promise, but rarely in learning do we have big data. Certainly, in evaluating a single course, there is no big data. Even when we do have lots of data, the data has to be meaningful to be of use. Machine learning doesn’t work well if the most important factors aren’t collected as data. When we measure what is easy to measure compared to what is important to measure, we will discover mere trinkets of meaning.
Measuring On-the-Job Learning
- While it would be great to capture data on people’s efforts in learning on the job, so far it seems we are measuring what’s easy to measure, but perhaps not what is important to measure.
- We have failed to consider that measuring on-the-job learning could have as many negative consequences as positive consequences.
- Even with the promise of xAPI, the big obstacle is how to capture on-the-job learning data without such data-capture being onerous.
- Sometimes we forget that managers have been responsible for their teams’ learning ever since the modern organization was born. A mistake we make is creating another layer of learning infrastructure instead of leveraging managers.
The Biasing Effects of Pretests
- We forget that pretests—even if there is no feedback given—produce learning effects; perhaps activating interest, triggering future knowledge-seeking behavior, creating schemas that support knowledge formation, etc. This is problematic when we take pretested learning programs as representative of non-pretested learning results. For example, when we assume that a course piloted with a pretest-posttest design will produce similar results to the same course without the pretest.
- There is a similar problem with time-series evaluation designs as earlier assessments can affect learning for both good and ill. So, for example, if we see improvements in learning over time, it might be due to the assessment intervention itself rather than the actual learning program.
Not Focusing First on Evaluation Goals
- We too often measure just to measure. We don’t think about what decisions we want to be able to make based on our evaluation work.
- Too often we don’t start with the questions we want answers to and design our evaluations to answer those questions.
- We ask questions on learner surveys that give us information that we cannot act on are unlikely to act on even if we can.
Not Using Evaluation as a Golden Opportunity to Educate or Nudge Our Key Stakeholders (Including Ourselves)
- We fail to use the rare opportunity that evaluation provides—the opportunity to have meaningful conversations with key stakeholders—to push specific goals we have for action.
- We fail to use evaluation to promote a brand-like idea of who we are as a learning organization. For example, by asking questions about the support we provide to help learners apply their learning to their work, we could burnish our brand image as a learning department that is also a performance-improvement department.
Not Integrating Evaluation into Our Design and Development Process from the Beginning
- Too often we begin thinking about evaluation after we’ve already designed a learning program.
- Ideally, we would start with a set of evaluation objectives (that is, clear descriptions of the metrics and evaluation methods we will use), so we know in advance—and can negotiate with stakeholders in advance—how we will measure our learning outcomes.
Designing Evaluation Items from Poorly Defined Objectives
- Too often we begin the evaluation process by specifying low-level learning objectives that utilize action verbs (e.g., list, explain, etc.) and then derive our evaluation items from those low-level constructs—causing our evaluations to be focused on less-than meaningful metrics.
- Too often we utilize Bloom’s taxonomy to design our learning-evaluation assessments, distracting us from focusing on more powerful research-inspired considerations like contextually-realistic decisions and tasks.
- Ideally, instead of starting from poorly defined instructional objectives, we should be starting from more performance-focused evaluation objectives.
Measuring Only Obtrusively
- We focus mostly on obtrusive measures of learning (like knowledge checks pasted on at the end of modules) when we could also use unobtrusive measures of learning (challenging tasks incorporated as part of the learning).
- We fail to utilize subscription-learning opportunities (short learning sessions spread over time) to measure learning, where challenges feel like learning to learners but are also used by us to evaluate the strengths and weaknesses of our learning designs.
Failing to Distinguish between Validating Data and Non-Validating Data
- We too often fail to distinguish between data that can validly assess the success of a learning intervention and data that is a poor indicator of success. Some data may be useful to us, but not indicative of the success of learning. For example, the number of people who attend a training tells us nothing about whether the learning was well designed, but it can give us data to ensure that we have a large enough room the next time we run the class.
- We too often give ourselves credit for success when it is unwarranted; for example by capturing and reporting data on attendance, learner attention, learner interest, and learner participation—all of which are non-validating data. They can tell us things, but they cannot provide a valid indication of whether learning was successful.
Failing to Consider the Importance of Remembering
- We fail to see remembering as a critical node on the causal pathway from comprehension to remembering to work performance to results. By ignoring the critical importance of remembering, we leave a big blind spot in our evaluation systems.
- We too often measure learner comprehension and assume the result will be indicative of learners’ later ability to remember what they’ve learned. This is a huge blind spot because people can demonstrate understanding today but forget that understanding tomorrow or a week from now. By fooling ourselves with short-term tests of memory, we enable our learning interventions to continue with designs that fail to support long-term remembering (and fail to minimize forgetting).
- Our reliance on the Kirkpatrick-Katzell Four-Level Model exacerbates this tendency, as the Four Levels completely ignores the importance of remembering.
Failing to Evaluate Our Use of Prompting Mechanisms
- While we know we should measure learning, we almost always completely forget to measure the use and value of prompting mechanisms like job aids, performance support, signage, and other devices for directly prompting or guiding performance.
- Similarly, we fail to measure the synergy between training and prompting mechanisms. Certainly there are better ways to mix training and job aids for example, yet we rarely test different ways to use job aids to support training results.
- We also fail to examine grassroots prompting mechanisms—those crafted not through some formal authority, but by people doing the work. By gathering grassroots job aids and evaluating them against the more formal one’s we’ve developed, we can make better decisions about which ones to use.
Failing to Measure when Learning Technologies Give Us Obvious Opportunities
- As more and more learning interventions utilize some form of digital technology, we are failing to parlay the data-gathering capabilities of these technologies for use in learning evaluation. Even the simplest affordances are going unrequited. For example, we could easily keep track of how long it takes a learner to complete a task, we could diagnose knowledge through relatively simple mechanisms, we can provide follow-up assessment items after delays—and yet too few of us are using these capabilities and our authoring tools have not been redesigned to intuitively enable this functionality.
- We are too often failing in using the power of technology to enable the use of social evaluation methods. For example, we know from the research that peers often provide better feedback than experts to support learning—surely we can use this capability on evaluation practice as well.
- We fail to use technology to enable random assignment of learners to treatments—to different learning methods—to give us insights into what works best for our particular learners, content, situation, etc.
Failing to Push Against Poor Evaluation Practices
- Too often we report out evaluation data that are of dubious merit. For example, we highlight the number of learners who completed our programs, their general level of satisfaction, the number of words they utilized in a discussion forum, whether they were paying attention during a page-turning elearning program. By reporting these out, we venerate these measures as important, when that are not—or when they are not as important as other evaluation metrics.
- Our trade organizations are guilty of this as well, honoring organizations for “best-of-awards” that highlight the number of people who were trained, etc.
- The Kirkpatrick-Katzell Four-Level model is silent on poor practices, except that it does rightly cast suspicion on learner reaction data by putting it only at Level 1.
Please Add Ideas or Comment
This is some of what I’ve seen, but I’m sure some of you have seen other mistakes in learning evaluation. Please add them below… Also, feel free to comment on the items in my list, improving them, adding contingencies, attempting to refute that they are mistakes, etc. Thanks for your insights!
= Will Thalheimer