The Busuu Placement Test, Part II: how do you measure task difficulty and learner ability at the same time?

Stanisław Pstrokoński
Busuu Tech
Published in
12 min readJun 5, 2023

--

The Busuu Placement Test uses Computerised Adaptive Testing (as explained in a previous article) based on a machine learning model known as Item Response Theory (IRT). Our improved Placement Test led to a better user experience and an increase in conversion in our apps.

In this article, we will dig into some deep issues in educational measurement and psychometrics, through the medium of skateboarding, (yes I am down with the kids), in order to understand how IRT helps us to solve them.

(Note that, from here onwards, I will be using the term item synonymously with question, exercise, or task, following the standard jargon of the field.)

Measuring task difficulty

One of the basic ideas of Item Response Theory is that an item’s difficulty is not equivalent to its pass rate.

Suppose you wanted to measure how difficult an item (question, exercise, task) is. How would you do this? The most intuitive answer might be to use its pass rate — if item A has a 90% pass rate and item B has a 80% pass rate, then item A is easier than item B.

But now consider an example. Suppose you are assessing skateboarding skills. Check out these two lads rocking out on their skateboards.

Source — Getty Images.
Source — Getty Images.

At a skate park, you might see people like the upper picture trying to kick-flip down a staircase, and managing to land without falling 80% of the time. And with an absolute beginner, you might see them doing something like the lower picture, just trying to skate in a straight line on a flat surface, but nevertheless only managing to avoid falling off 70% of the time.

Does this mean that kick-flips (80% “pass rate”) are easier than skating on a flat surface (70% “pass rate”)? Of course not! Clearly, the real reason for these numbers is that the kick-flippers are better skateboarders, and the weaker skateboarders have effectively self-selected away from doing that task. So it’s a difference in the test takers that is fooling us about the nature of the test items.

The below table is another illustration of this issue. (The % pass rates are just for argument’s sake.) Notice how the “hard question” and “easy question” have the same pass rate, depending on who you ask it to.

Workaround 1: random sampling?

Another way of framing the issue described above is that we don’t have a balanced sample of users to make a fair measurement of the activities’ difficulties.

In principle, if we kept giving learners items at random then over time, we should eventually get a balanced sample of learners answering each item.

However, there are two issues that immediately jump out of that statement:

…if we kept asking learners items at random….

This would defeat the entire object of personalisation, which is to tailor the choice of question to the individual, i.e. the questions delivered to each person would be most certainly not random. In skateboarding terms, such randomness (lack of personalisation) would mean making all the beginners try to do kick-flips right away, as well as boring the elite skaters with lots of straight-line flat-ground skating just like the absolute beginners.

over time we should eventually get a balanced sample…

It would be a shame to have to wait until we had a lot of data to do something with it. Thankfully, IRT gives us the possibility of making the most of our data from the very start.

Workaround 2: stratified sampling within our dataset?

Another possibility: why don’t we just use a subset of our data which represents a good cross-section of users? There are some issues with this also.

  1. First of all, how would we be able to tell ahead of time what the ability level is of different users? You can’t tell who’s a beginner or advanced skateboarder (/ reader / language learner / maths student etc. etc.) based on just looking up their ID in a database. But that also means that it’s hard to stratify the population.
  2. Secondly, for some questions we may never get answers from certain parts of the population, since e.g. very easy questions are unlikely to be something that we’d like to ask advanced learners. This is the point I made above about absolute beginners not practising kick-flips.
  3. Finally, it’s a shame to not use all of our data, and it would be nice to use it no matter how much or little we have, or who we have it from.

Detecting “faulty” questions

Another important issue that IRT helps us with is in detecting “faulty” items — cases where something is wrong with a question, and we can detect it through the statistical analysis that IRT implicitly provides.

What would you make of questions that have the following distributions of correct answers?

Item C is the most sensible-looking of the three. Advanced learners are getting a significantly higher pass rate than beginners, as expected.

Item A might seem impossible — how is it that a beginner could get a higher pass rate than an advanced learner? Nevertheless, it does happen. The most likely reason for it is when the incorrect answer is marked as correct due to human error (or a bug in content autogeneration, such as an LLM “hallucination”) while writing the content. In this specific case, it could be a True/False question, with beginners tending to choose the wrong option slightly more but being marked correct, whereas 80% of advanced learners choose the correct answer, but are told that they got it wrong.

A true-false question from Busuu’s Spanish course. The correct answer should be “True” here, but what if, by human error, the database says that the correct answer is “False”?

Item B seems highly suspicious also — how could there be such a small difference between beginners and advanced learners? There are two possibilities here. One is that the question might inadvertently test something else. For example, a question might mainly rely on knowledge of geography of the country where the language is spoken, but it’s possible to have beginner language learners who know geography well and advanced learners who don’t. Another option is that the question has ambiguous wording, leading some learners to misinterpret the task.

An example of this that we discovered was a question asking “how many rooms does this apartment have?”, but the meaning of “room” in this context is culture-specific, as “three-room apartment” in some cultures means an apartment with three bedrooms, in others one with three rooms excluding the kitchen and bathroom, and in others it would mean all the rooms including kitchen and bathroom.

Thankfully, IRT has ways of handling cases like Items A and B above, as we shall see.

The Fundamental Duality of Educational Materials

As you might see from the above, the reason why IRT is a useful tool for learning engineering is based on a rather deep, central issue with any educational material, which I call the Fundamental Duality of Educational Materials, because I like to sound clever.

(I’ve literally never heard any name for this idea in the literature, so I made one up. Nevertheless, insiders seems to refer to it implicitly all the time.)

The issue is that any time that somebody is learning from educational materials, it is fundamentally an interaction, which means that there are two sources of causation for anything — it could be something about the materials, or something about the user, or it could be some peculiarity of the way the user and the materials interact. Physics enthusiasts might be reminded of the peculiar nature of quantum physics, where the observer actually becomes an important feature of the experiment itself.

When you think about it, this is actually true of everything in science. When a physicist studies light, they study how light interacts with matter of different kinds. Even seeing light with your eyes is a type of interaction of the matter in your eye with the light. The reason that it’s foregrounded in this situation in particular is because normally in science we can control our variables — we make sure only to vary one thing at a time, to study how e.g. the wavelength of light affects refraction angle; but in our situation, we have multiple variables varying at once in a way we can’t control, so we need to come up with a way in which to draw inferences from this potentially confusing mess.

Another way of thinking about this is that fundamentally we are engaging in psychometrics, but our content is our “measuring tape” that we’re using to measure people’s latent psychological features (like language ability). The trouble is, what if you don’t know how long your measuring tape is, or if it has no markings? That is exactly the issue we are facing in creating these sorts of assessments — you need to “measure” your measuring tape at the same time as you’re measuring your target.

We see a man being angry on the phone. Is he having to deal with something that would enrage anyone? Or is he just a particularly angry person? It’s the same dilemma. Image source.

So when we see the outcome a user trying an exercise, there are so many causal factors that could be behind that performance, and it’s tricky to figure out which ones are responsible…

Trying to figure out which one of these is the true reason is a real head-scratcher.

The power of IRT — a brief introduction

Item Response Theory (IRT) started in the 1960s as an improvement on Classical Test Theory (CTT). While CTT focused on the test and its properties, IRT focused on the item (i.e. the question, task, or exercise). It’s now one of the most developed areas of psychometrics, as attested by the three-volume Handbook.

I never leave the house without it.

In a future post, I will go more into the mathematics of and training procedures for IRT models for all you data scientist nerds out there. But here I will limit myself to the high-level ideas which are driving IRT and its usefulness.

In IRT, each item has four properties:

  1. difficulty — how hard this item is to answer/perform correctly
  2. discrimination — a measure of how well the item distinguishes between learners of different ability levels
  3. guessing rate — the probability that a user who doesn’t know the answer will guess it correctly
  4. slipping rate — the probability that a user who knows the answer will make a mistake or slip up, and answer the item incorrectly

Difficulty

With these four parameters, each item gets its own graph, called an item response curve. The y-axis represents probability of answering a question correctly, and the x-axis represents the ability level of a learner trying to answer it.

y-axis: probability of correct answer. x-axis: user ability.

As you move from left to right on this graph, you see how likely a person would be to answer the question correctly. So e.g. a user with an ability score…:

  • of 1.0 would have a 50% chance of getting this question right;
  • of 1.1 would have an almost 70% chance;
  • of 0.5 would have a roughly 3% chance; and
  • of 2 would have an over 99% chance of getting it right.

Note that this means that the pass rate of a test question is not really defined until you know who is answering it. That’s because the pass rate is equivalent to the y-values, but you don’t know the y-value until you pass in the x-value (of user ability).

Now take a look at these two curves, each representing a different test question:

A user with an ability score of 1.5 would have a ~97% chance of answering the red item correctly, but a ~3% chance of answering the black item correctly. This means that the black item is more difficult.

Now imagine that we have a set of users who are at ability level 2.0 who answer the black item, and a set at 1.5 who answer the red item. Even though the black item is harder, the pass rate of the black item will be higher because the people answering the question are of higher ability.

So the power of this approach is that difficulty and pass rate have been separated into two different things. Pass rate is equivalent to the y-axis of these curves; but exactly where you should read the pass rate from depends on the ability of the learner (which is the x-axis).

Discrimination

Check out these two super weird exercises.

The red one is backwards. What? What does that even mean?

If you follow the probability values, you’ll see that as a user gets higher in ability, they have a lower chance of answering this red item correctly.

What about the black one? Even though it’s the right way round, it grows so slowly. This exercise has only a very slight increase in pass rate as one increases in ability.

Now take another look at our table from before:

Remember me? ;)

Can you see the connection? That’s right, Item A, the “broken” one, is represented by the red curve; and Item B, the suspicious-looking one, is the black curve.

IRT encodes this shape with its discrimination parameter, which is a kind of slope property. The red item has a negative discrimination and the black item has a very low discrimination — something that is easy to read from the data after you train the IRT model, to feed back to your content team to help them catch and fix these sorts of issues.

Guessing and Slipping

What about if your exercises look like this?

The red item doesn’t quite reach a y-value of 1, and the black item starts from 0.6 instead of from 0. What does that mean?

This means that even users at a very low ability level are still getting the black item right 60% of the time. This is called a 50% guessing rate. On the other hand, even very high ability users only get the red item correct 90% of the time. This is called a 10% slipping rate.

The red item is understandable, because it’s quite common for people to have a mouse slip or not be fully paying attention when doing something on an app. It’s actually an improvement in the model if you can include that.

The black item’s shape is odd, though. You might expect a 50% guessing rate for a True/False question, or a 33% guessing rate for a multiple choice question with two distractors, but a 60% guessing rate? Something fishy is clearly going on there.

So this is another way in which IRT can help us understand the nature of the content we’ve written and how our users are interacting with in, and allow us to take action to fix any exercises that seem to be performing strangely.

So, why IRT?

The key point is that IRT allows us to measure both learner and item properties even with “unbalanced” samples (which is almost always what we have). You can learn about your learners from your materials, and about your materials from your learners.

Another key advantage is that IRT helps you detect and measure which questions are performing poorly so that your content team can step in and make changes.

It also allows us to prepare adaptive tests — more on this in a later post.

Next time…

We’ll get down into some more technical details, and dig into how IRT models are trained. Until then!

--

--