Tag Archive for: evaluation

Original post appeared in 2011. I update it here.

Updated Article

When companies think of evaluation, they often first think of benchmarking their performance against other companies. There are important reasons to be skeptical of this type of approach, especially as a sole source of direction.

I often add this warning to my workshops on how to create more effective smile sheets: Watch out! There are vendors in the learning field who will attempt to convince you that you need to benchmark your smile sheets against your industry. You will spend (waste) a lot of money with these extra benchmarking efforts!

Two forms of benchmarking are common, (1) idea-generation, and (2) comparison. Idea-generation involves looking at other company’s methodologies and then assessing whether particular methods would work well at our company. This is a reasonable procedure only to the extent that we can tell whether the other companies have similar situations to ours and whether the methodologies have really been successful at those other companies.

Comparison benchmarking for training and development looks further at a multitude of learning methods and results and specifically attempts to find a wide range of other companies to benchmark against. This approach requires stringent attempts to create valid comparisons. This type of benchmarking is valuable only to the extent that we can determine whether we are comparing our results to good companies or bad and whether the comparison metrics are important in the first place.

Both types of benchmarking require exhaustive efforts and suffer from validity problems. It is just too easy to latch on to other company’s phantom results (i.e., results that seem impressive but evaporate upon close examination). Picking the right metrics are difficult (i.e., a business can be judged on its stock price, its revenues, profits, market share, etc.). Comparing companies between industries presents the proverbial apple-to-orange problem. It’s not always clear why one business is better than another (e.g., It is hard to know what really drives Apple Computer’s current success: its brand image, its products, its positioning versus its competitors, its leaders, its financial savvy, its customer service, its manufacturing, its project management, its sourcing, its hiring, or something else). Finally, and most pertinent here, it is extremely difficult to determine which companies are really using best practices (e.g., see Phil Rosenweig’s highly regarded book on The Halo Effect) because companies’ overall results usually cloud and obscure the on-the-job realities of what’s happening.

The difficulty of assessing best practices in general pales in comparison to the difficulties of assessing its training-and-development practices. The problem is that there just aren’t universally accepted and comparable metrics to utilize for training and development. Where baseball teams have wins and losses, runs scored, and such; and businesses have revenues and profits and the like; training and development efforts produce more fuzzy numbers—certainly ones that aren’t comparable from company to company. Reviews of the research literature on training evaluation have found very low levels of correlation (usually below .20) between different types of learning assessments (e.g., Alliger, Tannenbaum, Bennett, Traver, & Shotland, 1997; Sitzmann, Brown, Casper, Ely, & Zimmerman, 2008).

Of course, we shouldn’t dismiss all benchmarking efforts. Rigorous benchmarking efforts that are understood with a clear perspective can have value. Idea-generation brainstorming is probably more viable than a focus on comparison. By looking to other companies’ practices, we can gain insights and consider new ideas. Of course, we will want to be careful not to move toward the mediocre average instead of looking to excel.

The bottom line on benchmarking from other companies is: be careful, be willing to spend lots of time and money, and don’t rely on cross-company comparisons as your only indicator.

Finally, any results generated by brainstorming with other companies should be carefully considered and pilot-tested before too much investment is made.


Smile Sheet Issues

Both of the meta-analyses cited above found that smile sheets were correlated with an r = 0.09, which is virtually no correlation at all. I have detailed smile-sheet design problems in detail in my book, Performance-Focused Smile Sheets: A Radical Rethinking of a Dangerous Art Form. In short, most smile sheets focus on learner satisfaction, and fail to focus on factors related to actual learning effectiveness. Most smile sheets utilize Likert-like scales or numeric scales that offer learners very little granularity between answer choices, opening up responding to bias, fatigue, and disinterest. Finally, most learners have fundamental misunderstandings about their own learning (Brown, Roediger & McDaniel, 2014; Kirschner & van Merriënboer, 2013), so asking for their perceptions with general questions about their perceptions is too often a dubious undertaking.

The bottom line is that traditional smile sheets are providing almost everyone with meaningless data in terms of learning effectiveness. When we benchmark our smile sheets against other companies’ smile sheets we compound our problems.


Wisdom from Earlier Comments

Ryan Watkins, researcher and industry guru, wrote:

I would add to this argument that other companies are no more static than our own — thus if we implement in September 2011 what they are doing in March 2011 from our benchmarking study, then we are still behind the competition. They are continually changing and benchmarking will rarely help you get ahead. Just think of all the companies that tried to benchmark the iPod, only to later learn that Apple had moved on to the iPhone while the others were trying to “benchmark” what they were doing with the iPod. The competition may have made some money, but Apple continues to win the major market share.

Mike Kunkle, sales training and performance expert, wrote:

Having used benchmarking (carefully and prudently) with good success, I can’t agree with avoiding it, as your title suggests, but do agree with the majority of your cautions and your perspectives later in the post.

Nuance and context matter greatly, as do picking the right metrics to compare, and culture, which is harder to assess. 70/20/10 performance management somehow worked at GE under Welch’s leadership. I’ve seen it fail miserably at other companies and wouldn’t recommend it as a general approach to good people or performance management.

In the sales performance arena, at least, benchmarking against similar companies or competitors does provide real benefit, especially in decision-making about which solutions might yield the best improvement. Comparing your metrics to world-class competitors and calculating what it would mean to you to move in that direction, allows for focus and prioritization, in a sea of choices.

It becomes even more interesting when you can benchmark internally, though. I’ve always loved this series of examples by Sales Benchmark Index:



Alliger, Tannenbaum, Bennett, Traver, & Shotland (1997). A meta-analysis of the relations among training criteria. Personnel Psychology, 50, 341-357.

Brown, P. C., Roediger, H. L., III, & McDaniel, M. A. (2014). Make It Stick: The Science of Successful Learning. Cambridge, MA: Belknap Press of Harvard University Press.

Kirschner, P. A., & van Merriënboer, J. J. G. (2013). Do learners really know best? Urban legends in education. Educational Psychologist, 48(3), 169–183.

Sitzmann, T., Brown, K. G., Casper, W. J., Ely, K., & Zimmerman, R. D. (2008). A review and meta-analysis of the nomological network of trainee reactions. Journal of Applied Psychology, 93, 280-295.

OMG! The best deal ever for a full-day workshop on how to radically improve your smile-sheet designs! Sponsored by the Hampton Roads Chapter of ISPI. Free book and subscription-learning thread too!


Friday, June 10, 2016

Reed Integration

7007 Harbour View Blvd #117

Suffolk, VA


Click here to register now…


Performance Objectives:

By completing this workshop and the after-course subscription-learning thread, you will know how to:

  1. Avoid the three most troublesome biases in measuring learning.

  2. Persuade your stakeholders to improve your organization’s smile sheets.

  3. Create more effective smile sheet questions.

  4. Create evaluation standards for each question to avoid bias.

  5. Envision learning measurement as a bulwark for improved learning design.


Recommended Audience:

The content of this workshop will be suitable to those who have at least some background and experience in the training field. It will be especially valuable to those who are responsible for learning evaluation or who manage the learning function.



This is a full-day workshop. Participants are encouraged to bring laptops if they prefer to use a computer to write their questions.  


Bonus Take-Away:

Each Participant will receive a copy of Dr. Thalheimer’s Book, Performance-Focused Smile Sheets: A Radical Rethinking of a Dangerous Art Form.

In a recent research article, Tobias Wolbring and Patrick Riordan report the results of a study looking into the effects of instructor “beauty” on college course evaluations. What they found might surprise you — or worry you — depending on your views on vagaries of fairness in life.

Before I reveal the results, let me say that this is one study (two experiments), and that the findings were very weak in the sense that the effects were small.

Their first study used a large data set involving university students. Given that the data was previously collected through routine evaluation procedures, the researchers could not be sure of the quality of the actual teaching, nor the true “beauty” of the instructors (they had to rely on online images).

The second study was a laboratory study where they could precisely vary the level of beauty of the instructor and their gender, while keeping the actual instructional materials consistent. Unfortunately, “the instruction” consisted of an 11-minute audio lecture taught by relatively young instructors (young adults), so it’s not clear whether their results would generalize to more realistic instructional situations.

In both studies they relied on beauty as represented by facial beauty. While previous research shows that facial beauty is the primary way we rate each other on attractiveness, body beauty has also been found to have effects.

Their most compelling results:


They found that ratings of attractiveness are very consistent across raters. People seem to know who is attractive and who is not. This confirms findings of many studies.


Instructors who are more attractive, get better smile sheet ratings. Note that the effect was very small in both experiments. They confirmed what many other research studies have found, although their results were generally weaker than previous studies — probably due to the better controls utilized.


They found that instructors who are better looking engender less absenteeism. That is, students were more likely to show up for class when their instructor was attractive.


They found that it did not make a difference on the genders of the raters or instructors. It was hypothesized that female raters might respond differently to male and female instructors, and males would do the same. But this was not found. In previous studies there have been mixed results.


In the second experiment, where they actually gave learners a test of what they’d learned, attractive instructors engendered higher scores on a difficult test, but not an easy test. The researchers hypothesize that learners engage more fully when their instructors are attractive.


In the second experiment, they asked learners to either: (a) take a test first and then evaluate the course, or (b) do the evaluation first and then take the test. Did it matter? Yes! The researchers hypothesized that highly-attractive instructors would be penalized for giving a hard test more than their unattractive colleagues. This prediction was confirmed. When the difficult test came before the evaluation, better looking instructors were rated more poorly than less attractive instructors. Not much difference was found for the easy test.

Ramifications for Learning Professionals

First, let me caveat these thoughts with the reminder that this is just one study! Second, the study’s effects were relatively weak. Third, their results — even if valid — might not be relevant to your learners, your instructors, your organization, your situation, et cetera!

  1. If you’re a trainer, instructor, teacher, professor — get beautiful! Obviously, you can’t change your bone structure or symmetry, but you can do some things to make yourself more attractive. I drink raw spinach smoothies and climb telephone poles with my bare hands to strengthen my shoulders and give me that upside-down triangle attractiveness, while wearing the most expensive suits I can afford — $199 at Men’s Warehouse; all with the purpose of pushing myself above the threshold of … I can’t even say the word. You’ll have to find what works for you.
  2. If you refuse to sell your soul or put in time at the gym, you can always become a behind-the-scenes instructional designer or a research translator. As Clint said, “A man’s got to know his limitations.”
  3. Okay, I’ll be serious. We shouldn’t discount attractiveness entirely. It may make a small difference. On the other hand, we have more important, more leverageable actions we can take. I like the research-based findings that we all get judged primarily on two dimensions warmth/trust and competence. Be personable, authentically trustworthy, and work hard to do good work.
  4. The finding from the second experiment that better looking instructors might prompt more engagement and more learning — that I find intriguing. It may suggest, more generally, that the likability/attractiveness of our instructors or elearning narrators may be important in keeping our learners engaged. The research isn’t a slam dunk, but it may be suggestive.
  5. In terms of learning measurement, the results may suggest that evaluations come before difficult performance tests. I don’t know though how this relates to adults in workplace learning. They might be more thankful for instructional rigor if it helps them perform better in their jobs.
  6. More research is needed!

Research Reviewed

Wolbring, T., & Riordan, P. (2016). How beauty works. Theoretical mechanisms and two
empirical applications on students’ evaluation of teaching. Social Science Research, 57, 253-272.

As a youth soccer coach for many years I have struggled to evaluate my own players and have seen how my soccer league evaluates players to place them on teams. As a professional learning-and-performance consultant who has focused extensively on measurement and evaluation, I think we can all do better, me included. To this end, I have spent the last two years creating a series of evaluation tools for use by coaches and youth soccer leagues. I’m sure these forms are not perfect, but I’m absolutely positive that they will be a huge improvement over the typical forms utilized by most youth soccer organizations. I have developed the forms so that they can be modified and they are made available for free to anyone who coaches youth soccer.

At the bottom of this post, I’ll include a list of the most common mistakes that are made in youth-soccer evaluation. For my regular blog readers–those who come to me for research-based recommendations on workplace learning-and-performance–you’ll see relevance to your own work in this list of evaluation mistakes.

I have developed four separate forms for evaluation. That may seem like a lot until you see how they will help you as a coach (and as a soccer league) meet varied goals you have. I will provide each form as a PDF (so you can see what the form is supposed to look like regardless of your computer configuration) and as a Word Document (so you can make changes if you like).

I’ve also provided a short set of instructions.


Note from Will in November 2017:

Although my work is in the workplace learning field, this blog post–from 2012–is one of the most popular posts on my blog, often receiving over 2000 unique visitors per year.


The Forms

1.   Player Ranking Form:  This form evaluates players on 26 soccer competencies and 4 player-comparison items, giving each player a numerical score based on these items AND an overall rating. This form is intended to provide leagues with ranking information so that they can better place players on teams for the upcoming season.

2. Player Development Form:
  This form evaluates players on the 26 soccer competencies. This form is intended for use by coaches to help support their players in development. Indeed, this form can be shared with players and parents to help players focus on their development needs.

3. Team Evaluation Form:
  This form helps coaches use practices and games to evaluate their players on the 26 key competencies. Specifically, it enables them to use one two-page form to evaluate every player on their team.

4. Field Evaluation Form:  This form enables skilled evaluators to judge the performance of players during small-group scrimmages. Like the Player Ranking Form, it provides player-comparison information to leagues (or to soccer clubs).

The Most Common Mistakes in Youth-Soccer Evaluation

  1. When skills evaluated are not clear to evaluators. So for example, having players rated on their “agility” will not provide good data because “agility” will likely mean different things to different people.
  2. When skills are evaluated along too many dimensions. So for example, evaluating a player on their “ball-handling skills, speed, and stamina” covers too many dimensions at once—a player could have excellent ball-handling skills but have terrible stamina.
  3. When the rating scales that evaluators are asked to use make it hard to select between different levels of competence. So for example, while “ball-handling” might reasonably be evaluated, it may be hard for an evaluator to determine whether a player is excellent, very good, average, fair, or poor in ball-handling. Generally, it is better to have clear criteria and ask whether or not a player meets those criteria. Four or Five-Point scales are not recommended.
  4. When evaluators can’t assess skills because of the speed of action, the large number of players involved, or the difficulty of noticing the skills targeted. For example, evaluations of scrimmages that involve more than four players on a side make it extremely difficult for the evaluators to notice the contributions of each player.
  5. When bias affects evaluators’ judgments. Because the human mind is always working subconsciously, biases can be easily introduced. So for example, it is bad practice to give evaluators the coaches’ ratings of players before those players take part in a scrimmage-based evaluation.
  6. When bias leads to a generalized positive or negative evaluation. Because evaluation is difficult and is largely a subconscious process, a first impression can skew an evaluation away from what is valid. For example, when a player is seen as getting outplayed in the first few minutes of a scrimmage, his/her later excellent play may be ignored or downplayed. Similarly, when a player is intimidated early in the season, a coach may not fully notice his/her gritty determination later in the year.
  7. When bias comes from too few observations. Because evaluation is an inexact process, evaluation results are likely to be more valid if the evaluation utilizes (a) more observations (b) by more evaluators (c) focusing on more varied soccer situations. Coaches who see their players over time and in many soccer situations are less likely to suffer from bias, although they too have to watch out that their first impressions don’t cloud their judgments. And of course, it is helpful to get assessments beyond one or two coaches.
  8. When players are either paired with, or are playing against, players who are unrepresentative of realistic competition. For example, players who are paired against really weak players may look strong in comparison. Players who are paired as teammates with really good players may look strong because of their teammates’ strong play. Finally, players who only have experience playing weaker players may not play well when being evaluated against stronger players even though they might be expected to improve by moving up and gaining experience with those same players.
  9. When the wrong things are evaluated. Obviously, it’s critical to evaluate the right soccer skills. So for example, evaluating a player on how well he/she can pass to a stationary player is not as valid as seeing whether good passes are made in realistic game-like situations when players are moving around. The more game-like the situations, the better the evaluation.
  10. When evaluations are done by remembering, not observing. Many coaches fill out their evaluation forms back home late at night instead of evaluating their players while observing them. The problem with this memory-based approach is that introduces huge biases into the process. First, memory is not perfect, so evaluators may not remember correctly. Second, memory is selective. We remember some things and forget others. Players must be evaluated primarily through observation, not memory.
  11. Encouraging players to compare themselves to others. As coaches, one of our main goals is to help our players learn to develop their skills as players, as teammates, as people, and as thinkers. Unfortunately, when players focus on how well they are doing in comparison to others, they are less likely to focus on their own skill development. It is generally a mistake to use evaluations to encourage players to compare themselves to others. While players may be inclined to compare themselves to others, coaches can limit the negative effects of this by having each player focus on their own key competencies to improve.
  12. Encouraging players to focus on how good they are overall, instead of having them focus on what they are good at and what they still have to work on. For our players to get better, they have to put effort into getting better. If they believe their skills are fixed and not easily changed, they will have no motivation to put any effort into their own improvement. Evaluations should be designed NOT to put kids in categories (except when absolutely necessary for team assignments and the like), but rather to show them what they need to work on to get better. As coaches, we should teach the importance of giving effort to deliberate practice, encouraging our players to refine and speed their best skills and improve on their weakest skills.
  13. Encouraging players to focus on too many improvements at once. To help our players (a) avoid frustration, (b) avoid thinking of themselves as poor players, and (c) avoid overwhelming their ability to focus, we ought to have them only focus on a few major self-improvement goals at one time.


Social Media is hot, but it is not clear how well we are measuring social media.

A couple of years ago I wrote an article for the eLearning Guild about measuring social media. But it's not clear that we've got this nailed yet.

With this worry in mind, I've created a research survey to begin a process to see how best social-media (of the kind we might use to bolster workplace learning-and-performance) can be measured.

Here's the survey link. Please take the survey yourself. You don't have to be an expert to take it.

Here's my thinking so far on this. Please send wisdom if I've missed something.

  1. We can think about measuring social media the same way we measure any learning intervention.
  2. We can also create a list of all the proposed benefits for social media, and the proposed costs, and all the proposed harms, and we can see how people are measuring these now. The survey will help us with this second approach.

Note: Survey results will be made available for free. If you take the survey, you'll get early releases of the survey results and recommendations.

Also, this is not the kind of survey that needs good representative sampling, so feel free to share this far and wide.

Here is the direct link to the survey:   http://tinyurl.com/4tlslol

Here is the direct link to this blog post:   http://tinyurl.com/465ekpa

Last year I wrote at length about my efforts to improve my own smile sheets. It turns out that this is an evolving effort as I continue to learn from my learners and my experience.

Check out my new 2009 version.

You may remember that one of the major improvements in my smile sheet was to ask learners about the value and newness of EACH CONCEPT TAUGHT (or at least each MAJOR concept). This is beneficial because people respond more accurately to specifics than to generalities, they respond better to concrete learning points than to the vague semblance of a full learning experience.

What I forgot in my previous version was the importance of getting specific feedback on how well I taught each concept. Doh!

My latest version adds a column for how well each concept is taught. There is absolutely no more room to add any columns (I didn't think I could fit this latest one in), so I suppose this may allow diminishing returns on any more improvements.

Check it out and let me know what you think.

Update May 2014

Sarah Boehle wrote an article that included Neil Rackham's famous story on the dangers of measuring training only with smile sheets. The story used to be available from Training Magazine directly, but after some earlier disruptions and recoveries at Training Magazine, their digital archive was reconstituted and currently only goes back to 2007.

Fortunately, you can read the article here.

Donald Kirkpatrick Answers Questions. Click to view.