The Busuu Placement Test, Part I: Computerised Adaptive Testing (CAT)

Stanisław Pstrokoński
Busuu Tech
Published in
10 min readMay 16, 2023

--

Imagine that you had to assess somebody’s language level in as short a test as possible. How would you do it?

At Busuu, we’ve developed a new computerised adaptive test based on Item Response Theory (IRT), which lets us assess a language level in as few as three questions. We managed to improve placement accuracy, reduce the number of questions needed, and increase conversions!

In this blog post, I’ll cover these questions:

  1. Why does Busuu need a Placement Test?
  2. What would an effective Placement Test be like?
  3. How did we use data science to make the Placement Test better?

Why does Busuu need a Placement Test?

Not all newly registered users of Busuu are absolute beginners in their target language. A Placement Test is designed to decide where in the course the users should start learning so they don’t have to trudge through material they already know, but equally don’t start from somewhere too difficult for them.

We know from user interviews that some people use the Placement Test as a general language level test rather than specifically for placement. We don’t recommend this, as the test is designed to trade off accuracy against speed, but the test should give a fair assessment of somebody’s overall language level even so.

From a business point of view, we know that a large proportion of registered users (around 50% in the middle of 2022) use the Placement Test, and many convert from free to paid users soon after this experience. Clearly, it’s essential to make this a satisfying feature.

What would a good Placement Test be like?

We aim to make a test produce an accurate picture of somebody’s language proficiency as quickly as possible.

Let’s unpack that.

What is a test, actually?

You may want to read our upcoming blog post: What’s the difference between an assessment, a measurement, and a test?

The TL;DR is that a test is a systematically elicits behaviour through set tasks. This sample of behaviour is used to estimate latent properties of the testee (in our case, their language knowledge) by assigning numbers or categories to those properties (i.e. tests are quantitative in their results). The tasks should relate to the latent properties in sensible, theoretically sound ways.

Speed and accuracy

These two aims are in tension, as by shortening a test, we generally make it less accurate. We solve this issue by specifying a ‘minimum acceptable’ level of accuracy, and a ‘maximum acceptable’ length, so that we stop the test when we have either reached acceptable accuracy or exceeded the maximum acceptable length.

However, one way of speeding up the test without sacrificing accuracy is by making it adaptive. You can think about how this would work by imagining yourself testing somebody in an interview format. If you asked basic questions and they didn’t know what you were talking about, you would realise they probably don’t know very much and would move on to simpler questions. Conversely, if they answer the first few questions swimmingly, you would press on with more challenging questions, realising they’re at a higher level than you assumed.

Just like Goldilocks and her porridge, in adaptive testing we aim to match the most suitable question to each person. Image source.

The benefit of adaptive testing is that you don’t waste time asking beginners hard questions or asking advanced learners easy questions. Those questions aren’t very informative, as beginners will almost certainly get hard questions wrong, and advanced learners will almost always get easy questions right. It doesn’t take much to get a ballpark impression of somebody’s level (e.g. after three incorrect answers at the start, you know they’re not advanced). Then you can focus on asking more informative questions, skipping other questions and saving time.

What is language proficiency?

We can characterise somebody’s level of language proficiency in several ways. We could aim to know exactly which words and grammar rules they know, so the output is, for example, a collection of vectors representing their knowledge state relating to this language. Or we could simply bundle what we know about them into a single label like ‘beginner’ or ‘upper intermediate’. We could think of the former as a ‘granular’ or ‘reductionist’ view of their language knowledge and the latter as a ‘holistic’ view.

While a detailed representation of a learner’s knowledge could be invaluable for personalisation, it is technically difficult to achieve, not necessary for this use case, and clashes with the aim of finishing the assessment quickly. For this reason, we should be satisfied with a holistic approach that assigns a label to a learner’s proficiency level.

Introducing Computerised Adaptive Testing (CAT )

Not this kind of testing. Image source.

The theory of computerised adaptive testing (CAT) gives us an approach to maximise the speed and accuracy of a test through adaptivity. It’s the approach that we’ve taken in making the Placement Test.

There’s a basic loop behind any computerised adaptive test. This is pretty much the definition of how any algorithm would execute such a test.

Generic CAT algorithm.

The key steps we can influence in this process are Set up / update user representation, Ready to stop the test?, and Select next task for user. That’s because, as data scientists, we get to choose the user representation and the task representation because of the fundamental duality of educational materials (see Part II of this article series for more details).

Below is a more detailed picture of Busuu’s specific solution to this problem. This is the current version of the Placement Test — for brevity, we won’t go into the previous version’s algorithm here. In the rest of this article, I will describe the current algorithm in more detail.

Busuu Placement Test’s CAT algorithm

Item Response Theory (IRT) — the model behind the algorithm 💻

The above flowchart demonstrates the algorithm of the CAT. Some parts of it are quite straightforward, e.g. Test too short? and Test too long? are simply comparisons to the minimum or maximum allowable test length. However, the algorithm depends on a model behind it that tells us about properties of items and learners, answering questions like:

  • How hard is this item?
  • Is this a good test item? (As in, does it provide us with much information about people’s language level?)
  • Given that user #123 answered questions 1 and 2 correctly but question 3 incorrectly, what should we estimate their level to be? How confident are we in that estimate?

The model that helps us do this is called Item Response Theory, or IRT for short. If you want to read more details about IRT, you can look at the next blog article in this series. The rest of this article will refer to the IRT model’s role in the test without detailing much about how it works.

Steps of the CAT 🐾

This section will elaborate on the diagram of Busuu’s Placement Test algorithm.

1. Set user’s ability distribution to empirical prior

The test can place a user into any one of six levels. Since we are using a Bayesian approach, we set a probability to them belonging to any one of these six levels. But the question remains: how should we set our prior distribution (i.e. the user representation before they start the test)?

We can do this through an empirical Bayes approach, as we were developing this enhanced version of the Placement Test and already had a lot of user data. We used IRT and user history to figure out the distribution of previous test-takers across each level.

Below is an example historical distribution of user abilities when taking the Placement Test. We use this to create a Bayesian prior for the new Placement Test.

historical_distribution = {
"a1.1": 0.2, # start of a1 - absolute beginner
"a1.2": 0.3, # halfway through a1 - false beginner
"a2": 0.25, # upper beginner
"b1": 0.12, # lower intermediate
"b2": 0.08, # upper intermediate
"c1": 0.05, # advanced
}

2. Give a question batch to the user

When the user starts the test, we give everyone the same three questions. At this stage, everyone has the same guess of their level — the prior distribution — so there is nothing to personalise on, which is why everyone gets the same questions.

We decided to operate the algorithm in batches of three questions, as this would reduce the potential network latency in calling our API, meaning at least two-thirds of the questions would be lag-free. There’s a trade-off here between latency reduction and granularity of adaptivity, and we felt three questions were a good balance.

3. Grade question batch

This is simply a True or False of whether they got each question right.

4. Update Bayesian distribution of user ability score based on IRT

Item response theory gives us:

  1. a difficulty estimate of the questions in our question bank, and
  2. a formula for calculating the probability that a person of a given ability level will answer a question of a given difficulty correctly.

We can use these to update our distribution representing what level we think the user has (by calculating the likelihood of the user being at each of the levels, and thus getting the posterior distribution). The result might look something like the following.

prior_dist = {
"a1.1": 0.2,
"a1.2": 0.3,
"a2": 0.25,
"b1": 0.12,
"b2": 0.08,
"c1": 0.05,
}
got_questions_right_posterior_dist = {
"a1.1": 0.05, # decreased
"a1.2": 0.3, # unchanged
"a2": 0.30, # increased
"b1": 0.16, # increased
"b2": 0.11, # increased
"c1": 0.08, # increased
}
got_questions_wrong_posterior_dist = {
"a1.1": 0.71, # increased
"a1.2": 0.15, # decreased
"a2": 0.1, # decreased
"b1": 0.03, # decreased
"b2": 0.009, # decreased
"c1": 0.001, # decreased
}

The code snippet above shows two examples: one where the user got all three questions wrong and one where they got them all right. Notice that when they did well, their probability of being an absolute beginner goes down and their probability of being more advanced goes up. When they did badly, the opposite happens. If the user gets two out of three questions right, the exact outcome would depend on the difficulty of the questions that they got right and wrong. Still, it’s most likely their probability of being at a higher level would increase — just not as much as it did in the case of getting all three questions right.

5. Ready to stop test?

We break this stage down into three questions.

  • Any of the ability score p’s > 0.9? This would mean that we are highly confident that somebody belongs to a particular category. We are ready to tell the user what level they are.
  • If no → Test too long? If we still aren’t over 90% sure that somebody belongs to a particular level, we might still want to end the test if it’s been going a long time. It would be a bad user experience to be stuck in a test, plus it prevents a situation where the algorithm doesn’t progress as expected.
  • If yes → Test too short? We added this because we felt it would be a bad user experience to be ‘consigned’ to absolute beginner after a mere three questions, even if the algorithm was highly confident of their level. We felt the user would prefer more of a chance to prove themselves.

6a. Select next task for user

This occurs if we are not ready to end the test yet, and is made up of two sub-processes.

  1. Calculate information score of remaining questions. Item Response Theory has a method of calculating how informative different questions would be about a user, given our current estimate of their level. By ‘remaining’ questions, we mean only those the user has not seen yet.
  2. Select questions with the highest information score. This is the essence of adaptivity — we are selecting the questions best for that user given our current estimate of their language level.

Once this is done, we continue from step two.

6b. Return level with highest probability

This occurs if we are ready to end the test. We find the level that seems most likely according to their current posterior, and we return that to the app to inform the user and place them in the appropriate place in the course.

Conclusion: what happened when we made these changes?

We new Placement Test was released in the first half of 2022.

Improving the Placement Test using CAT and IRT had several benefits.

First of all, and most strikingly, it increased conversions from free to paid subscriptions significantly for users who took the test. This is concrete evidence that a data science intervention made a difference to the business, and it’s the sort of thing the higher-ups like to hear most… 💸

Secondly, it allowed us to statistically measure content properties in a new way. Previously some at Busuu had been using pass rates to measure question difficulty and drop-off rates to detect faulty questions. Now, we have a new way to do this that can help our learning designers understand their content better and improve it in future. 📈

Finally, it opened the door to further user modelling and personalisation. This was one of our first major attempts to create an adaptive experience, as well as a quantitative model of user knowledge and user-content interaction. 🎓

Next time…

In Part II, we’ll examine and look deeper into how Item Response Theory works in more detail. See you there!

--

--