The World Bank just released a paper claiming that “AI-powered tutoring” had a “transformative impact” on students in Nigeria. But if you read the paper there are some big, glaring issues with their experiment that undermine their splashy result in obvious ways.
What They Did
The researchers created an after-school program in Benin City, Nigeria. The program met twice a week for 30 minute sessions. It took place over six weeks during summer 2024, at the end of the Nigerian school year. (Their 180-day academic calendar stretches from September through July.)
You can get a sense of how the after-school program worked from a video shared on World Bank Africa’s Facebook page. There’s a teacher in the room. There are about 30 kids, paired up, sitting in front of computers. They have been given English grammar assignments that require interacting with Microsoft Copilot.
The activites began by typing in a lengthy prompt to Copilot, explaining that they want the model to act as a “well-seasoned English grammar tutor.” They tell the model that the current lesson is about “adjectives” or “clauses” or whatever else it happened to be about. Then the kids were told to constructively interact with the model—for example, by asking it to create a sentence with an independent clause that describes Benin Kingdom’s history.

Now, could learning be happening here? Absolutely. Students are reviewing grammar taught in their normal English class. A teacher is there to supervise and help. Kids are working with classmates. They’re practicing writing, reading, and speaking English, which is not their first language.
This all seems more than fine, a worthy teaching experiment. The issues I see aren’t with the program but with the experiment attempting to capture its impact.
“Randomized Controlled Experiment”
The after-school program took place in nine schools, but most students didn’t take it. “All first-year senior secondary school students in these schools were informed about the program,” the researchers write. Kids had ten days to sign up. Then, randomly, 657 of those kids actually admitted into the program and 671 went about their business as usual.
Well, to be honest, this ruins everything.
I actually think the researchers are pretty convincing that their after-school program was good for the kids who took it. The researchers got access to the kids’ third term exam scores and the treatment group did 0.206 standard deviations better than the control, even when their performance on the second final is controlled for. The program wasn’t a waste of their time.
But that’s all this study is capable of showing—that the program wasn’t literally a waste of time. Because the control group, academically speaking, didn’t do anything. I mean presumably they did lots of things—played soccer, hung out with friends, cooked food, whatever. But fundamentally the after-school group studied more and the control group did not.
It didn’t have to be this way. The control group could have been assigned sections of a text to read with partners. Or they could have watched instructional videos from YouTube on their computers, putting ChatGPT head to head with older forms of digital media. Listen, they could have had the teachers just lecture those 30 kids for the same cost. As is, they compared their after-school program to nothing. They therefore only proved that their program is better than nothing.
That’s the big problem, but there are others, all of which are common with brief edu interventions:
To be eligible for the study, kids had to volunteer. So all the kids in either the treatment or the control were highly motivated to study with ChatGPT.1
They also report effects from their intervention on assessments the researchers themselves designed and administered. The effect sizes are larger for the researcher-designed assessment, and the big ridiculous claims (“equivalent to 1.5 to 2 years of business-as-usual schooling”) rely on these results. Hilariously, this includes “AI literacy,” and they report that their intervention had a big impact on kids’ AI knowledge. Well yeah, I sure hope it did!
Dropout. Only 422 students from the treatment group completed the final assessment. It was even harder for them to get the control group to complete the final test—only 337 control kids completed that. This is better than if kids were dropping out only of the treatment group, but this is a classic way that effect sizes get boosted in edu studies, by only having the most successful kids from an intervention take the final exams. You really want to reassure the readers that this isn’t what is happening, but they don’t do a thorough job of this.
Each of these issues strips away our ability to generalize from this experiment. Is sitting with CoPilot a good use of time? Would it work with typical students? How large of an impact did it have? Is it better than discussing a text with a classmate? Does AI tutoring work?
The answer: we don’t know, we don’t know, we don’t know, we don’t know.
“AI tutor”
There’s another thing bothering me about all this. The researchers told kids to prompt the model to act as a “well-seasoned English grammar tutor.” They instructed the model to “reply to my questions in a motivational and engaging tone.” The model was told to ask questions or propose exercises on the grammar topic of the day, to provide hints or corrective feedback when needed.
OK, fine. But is this tutoring?
I mean…not really? It’s just interacting with an LLM, which is a really cool dynamic text generator, but isn’t acting as a teacher. ChatGPT didn’t decide what students were ready to learn. It didn’t create the lesson. It didn’t even do all the instruction: “At the end of each session, the students were encouraged to reflect and discuss lessons learned and challenges encountered during session to facilitate knowledge sharing among the group,” the researchers write. That sounds like teaching! The human is doing it.
This is not school with a robot teacher. It’s more like when I tell my own students to take out Chromebooks and use a math practice app like Deltamath—except that in this case it’s an LLM, which is functioning like a magic, dynamic exercise book.
The world seems set on calling this sort of thing a “tutor” but I think that’s absurd. This is not a tutor. If you insist, call it AI-driven practice software.
Now, what’s cool is that Microsoft Copilot wasn’t designed to be practice software. It was designed to write emails and to let kids cheat on essays. That it helps highly-motivated Nigerian kids learn grammar is a cool bonus. I think that’s awesome. I’m optimistic that this means LLMs could be used to create more kinds of practice software for kids. Don’t ask me if that’s possible, I’m just a guy who gets paid to watch children calculate. But at least it’s a technological vision I can get behind—practice software for anything you want to learn.
But that’s not enough, is it? I guess every incentive is aligned to pretend that our sci-fi dreams and nightmares are coming true, right now, and that you’d better get on board or be left behind. It’s how you take a lovely little teaching experiment and present it as a “randomized controlled study” that presents “big results” on the “huge impact” of “AI tutoring.” In reality, it’s none of that.
Listen, the world is weird enough as it is. Please, can everyone try not to inflate this stuff further? I’m begging you.
They report that nearly 80% of participants across the study were female. I don’t know if that’s because some of the schools were all-girls or girls volunteered at a higher rate. Either way, the researchers report that girls were especially helped by the program.
Agreed, to (seemingly) not control or adjust for study duration seems quite the omission from the authors - time spent studying is one of the best correlates with grades achieved I believe (?).
Even as someone who has a (very, very) slight bias in favour of AI as a tool, if anything the fact there was seemingly a pretty non-significant improvement despite the control arm, as you say, probably just going home and playing football *and* considering that the students that *did* use the AI actually opted in and thus possessed some degree of intrinsic motivation is even more of a potential flag.
However, we should probably also (?) consider the fact that perhaps the kids that did opt-in were under-performers, with the kids opting-out not being so. I haven't been through the paper so perhaps they control for this, but it doesn't seem outside the realms of possibility if not.
Re what constitutes 'tutoring', I think I'd disagree slightly? Based on the description, it doesn't seem (to me) incorrect to deem this as tutoring.
I 100% agree that these things are absolutely not replacing teachers anytime soon (anyone who is building an education tool knows this or is lying), but it seems to me that your points for why it's not 'tutoring' (such as "the LLM didn't create the lesson" etc) *are* all within current LLMs ability (level-depending), so the fact the authors (for whatever reason) chose to not include this should be a mark against the authors potentially and not against the technology - in my opinion.
Let's say that the model used *were* to create some short lessons for the kids, ask questions, gauge their answers, evaluate, decide weak spots and outline the next lesson etc (very very loosely part of what shaeda.io will -hopefully- do) etc, this would be about as close to tutoring as tutoring can get, no?
I wrote an entire post on some potential issues/blindspots when it comes to using AI in classrooms, so I always enjoy reading similar blogs. The classic fire analogy (powerful if used right, disastrous if used wrong) is very fitting.
The lack of RCT knowledge in this post is definitely concerning. Basic understanding of potential outcomes would be great for the author, plus maybe some tips on selection bias and internal validity. Go off! Give us nothing, King!