When Machine Learning Uses Data Unrepresentative of the Domain

, , ,

I read a brilliantly clear article today by Karen Hao from the MIT Technology Review. It explains what machine learning is and provides a very clear diagram, which I really like.

Now, I am not a machine learning expert, but I have a hypothesis that has a ton of face validity when I look in the mirror. My hypothesis is this:

Machine learning will return meaningful results to the extent that the data it uses is representative of the domain of interest.

A simple thought experiment will demonstrate my point. If a learning machine is given data about professional baseball in the United States from 1890 to 2000, it would learn all kinds of things, including the benefits of pulling the ball as a batter. Pulling the ball occurs when a right-handed batter hits the ball to left field or a left-handed batter hits the ball to right field. In the long history of baseball, many hitters benefited by trying to pull the ball because it produces a more natural swing and one that generates more power. Starting in the 2000s, with the advent of advanced analytics that show where each player is likely to hit the ball, a maneuver called “the shift” has been used more and more, and pulling the ball consistently has become a disadvantage. In the shift, players in the field migrate to positions where the batter is most likely to hit the ball, thus negating the power benefits of pulling the ball. Our learning machine would not know about the decreased benefits of pulling the ball because it would never have seen that data (the data from 2000 to now).

Machine Learning about Learning

I raise this point because of the creeping danger in the world of learning and education. My concern is relevant to all domains where it is difficult to collect data on the most meaningful factors and outcomes, but where it is easy to collect data on less meaningful factors and outcomes. In such cases, our learning machines will only have access to the data that is easy to collect and will not have access to the data that is difficult or impossible to collect. People using machine learning on inadequate data sets will certainly find some interesting relationships in the data, but they will have no way of knowing what they’re missing. The worst part is that they’ll report out some fanciful finding, we’ll all jump up and down in excitement and then make bad decisions based on the bad learning caused by the incomplete data.

In the learning field—where trainers, instructional designers, elearning developers, and teachers reside—we have learned a great deal about research-based methods of improving learning results, but we don’t know everything. And, many of the factors which we know work are not tracked in most big data sets. Do we track the spacing effect, the number of concepts repeated with attention-grabbing variation, the alignment between context cues present in learning materials compared with the cues that will be present in our learners’ future performance situations? Ha! Our large data sets certainly miss many of these causal factors.

Our large data sets also fail to capture the most important outcomes metrics. Indeed, as I have been regularly recounting for years now, typical learning measurements are often biased by measuring immediately at the end of learning (before memories fade), by measuring in the learning context (where contextual cues offer inauthentic hints or subconscious triggering of recall targets), and by measuring with tests of low-level knowledge (compared to more relevant skill-focused decision-making or task performances). We also overwhelmingly rely on learner feedback surveys, both in workplace learning and in higher education. Learner surveys—at least traditional ones—have been found virtually uncorrelated with learning results. To use these meaningless metrics as a primary dependent variable (or just a variable) in a machine-learning data set is complete malpractice.

So if our machine learning data sets have a poor handle on both the inputs and outputs to learning, how can we see machine learning interpretations of learning data as anything but a shiny new alchemy?


Measurement Illuminates Some Things But Leaves Others Hidden

In my learning-evaluation workshops, I often show this image.

The theme expressed in the picture is relevant to all types of evaluation, but it is especially relevant for machine learning.

When we review our smile-sheet data, we should not fool ourselves into thinking that we have learned the truth about the success of our learning. When we see a beautiful data-visualized dashboard, we should not deceive ourselves and our organizations that what we see is all there is to see.

So it is with machine learning, especially in domains where the data is not all the data, where the data flawed, and where the boundaries on the full population of domain data are not known.


With Apologies to Karen Hao

I don’t know Karen, but I do love her diagram. It’s clear and makes some very cogent points—as does her accompanying article.

Here is her diagram, which you can see in the original at this URL.

Like measurement itself, I think the diagram illuminates some aspects of machine learning but fails to illuminate the danger of incomplete or unrepresentative data sets. So, I made a modification in the flow chart.

And yes, that seven-letter provocation is a new machine-learning term that arises from the data as I see it.

Corrective Feedback Welcome

As I said to start this invective, my hypothesis about machine learning and data is just that—a semi-educated hypothesis that deserves a review from people more knowledgeable than me about machine learning. So, what do you think machine learning gurus?


Karen Hao Responds

I’m so delighted! One day after I posted this, Karen Hao responded: