What Happens When AI is Used to Set Grades?

In 2020, with high school exams canceled in many countries, the International Baccalaureate Organization (IBO) deployed an AI to determine final grades based on current and historical data. When the results came in, many scores did not correlate with grades that had been predicted, as had been the case in previous years, prompting many people to appeal their grades. Unfortunately, the appeals system for grades had not been changed from previous years, which was assumed that students would write examination papers. Since university place offers in many countries are contingent on students achieving predicted grades, many students have been denied places at their universities of choice, which has resulted in a great deal of anger. This experience highlights the risks of delegating life-altering decisions to AI without considering how apparently anomalous decisions can be appealed and, if necessary, changed.

How would you feel if an algorithm determined where your child went to college?

This year Covid-19 locked down millions of high school seniors and governments around the world canceled year-end graduation exams, forcing examining boards everywhere to consider other ways of setting the final grades that would largely determine the future of the class of 2020. One of these Boards, the International Baccalaureate Organization (IBO), opted for using artificial intelligence (AI) to help set overall scores for high-school graduates based on students’ past work and other historic data. (We use the term AI broadly to mean a computer program that uses data to execute a task that humans typically perform, in this case processing student scores.)

The experiment was not a success, and thousands of unhappy students and parents have since launched a furious protest campaign. So, what went wrong and what does the experience tell us about the challenges that come with AI-enabled solutions?

What is the International Baccalaureate?

The IB is a rigorous and prestigious high-school certificate and diploma program taught by some of the world’s best schools. It opens doors to the world’s leading universities for talented and hard-working students in over 150 countries.

In a normal year, final grades are determined by coursework produced by the students and a final examination administered and corrected by the IBO directly. The coursework counts for some 20-30% of the overall final grade and the exam accounts for the remainder. Prior to the exam, teachers provide “predicted” grades, which allow universities to offer places conditional on the candidates’ final grades meeting the predictions. The IBO will also arrange independent grading of samples of each student’s coursework in order to discourage grade inflation by schools.

The process is generally considered to be a rigorous and well-regarded assessment protocol. The IBO has collected a substantial amount of data about each subject and school — hundreds of thousands of data points, in some cases going back over 50 years. Significantly, the relationship between predicted and final grades has been tight. At leading IB schools over 90% of grades have been equal to predicted, and over 95% of total scores have been within a point from that predicted (total scores are set on a scale of one to 45).

And then came Covid-19.

In the spring of 2020, IBO had to decide whether to allow the exams to proceed or cancel them and award grades some other way. Allowing exams risked the safety of students and teachers, and could create fairness issues — if, for instance, students in some countries were allowed to write the exams at home, while in others they had to sit exams at school.

Canceling the exams raised the question of how to assign grades, and that’s when IBO turned to AI. Using its trove of historical data about students’ course work and predicted grades, as well as the data about the actual grade obtained at exams in previous years, the IBO decided to build a model to calculate an overall score for each student – in a sense predicting what the 2020 students would have gotten at the exams. The model-building was outsourced to a subcontractor undisclosed at the time of publishing this article.

A crisis erupted when the results came out in early July 2020. Tens of thousands of students all over the world received grades that not only deviated substantially from their predicted grades but did so in unexplainable ways. Some 24,000, or more than 15% of all 2020 IB diploma recipients, have since signed the protest. IBO’s social media pages are flooded with furious comments. Several governments have also launched formal investigations, and numerous lawsuits are in preparation, some for data abuse under EU’s GDPR. What’s more, schools, students, and families involved in other high school programs that have also adopted AI solutions are raising very similar concerns, notably in the UK, where A level results are due out on August 13th, 2020.

Limited scope for appeal

As the outrage has spread, one critical and very practical question has been consistently raised by frustrated students and parents: How can they appeal the grades?

In normal years, the appeals process was well-defined and consisted of several levels, from the re-marking of an individual student’s exam to a review of marks for course work by subject at a given school. The former means having another look at a student’s work – a natural first step when the grades were based on such work. The latter refers to an adjustment that IBO may apply to a school’s grading of course work should a sample of work independently assessed by the IBO produce substantially different grades, on average, from those awarded by the school. The appeal process was well-understood and produced consistent results, but was not used frequently, largely because, as noted, there were few surprises when the final grades came out.

This year, the IB schools initially treated appeals as requests for re-marks of student work. But this poses a fundamental challenge: the graded papers were not in dispute — it was the AI assessment that was called into question. The AI did not actually correct any papers; it only produced final grades based on the data it was fed, which included teacher-corrected coursework and the predicted grades. Since the specifics of the program are not disclosed, all people can see are the results, many of which were highly anomalous, with final scores in some cases well below the marks of the teacher-graded coursework of the students involved. Unsurprisingly, the IBO’s appeals approach has not met with success — it is in no way aligned with the way in which the AI created the grades.

What can we learn?

The main lesson coming out of this experience is that any organization that decides to use an AI to produce an outcome as critical and sensitive as a high-school grade marking 12-years of student’s work, needs to be very clear about how the outcomes are produced and how they can be appealed in the event that they appear anomalous or unexpected. From the outside, it looks as though the IBO may have simply plugged the AI into the IB system to replace the exams and then assumed that the rest of the system — in particular the appeals process — could work as before.

So what sort of appeals process should the IBO have designed? First of all, the overall process of scoring and, more important, appealing the decision should be easy to explain, so that people understand what each next step will be. Note that this is not about explaining the AI “black box,” as current regulators do when arguing about the need for “explainable AI.” That would be almost impossible in many cases, since understanding the programming used in an AI generally requires a high level of technical sophistication. Rather, it is about making sure that people understand what information is used in assessing grades and what the steps are in the appeal process itself. So what the IBO could have done instead was offer appellants the right to a human-led re-evaluation of anomalous grades, specify what input data the appeal committee would focus on in reanalyzing the case, and explain how the problem would be fixed.

How the problem would be fixed would depend on whether the problem turned out to be student specific, school specific, or subject specific; a single student’s appeal might well affect other students depending on what components of the AI the appeal may relate to.

If, for example, a problem with an individual student’s grade seems to be driven by the school level data — possibly a number of students studying in that same school have had final grades that differed markedly from their predicted grades — then the appeal process would look at the grades of all students in that school. If needed, the AI algorithm itself would be adjusted for the school in question, without however affecting other schools, making sure the new scores provided by the AI are consistent across all schools while remaining the same for all but one school. In contrast, if the problem is linked to factors specific to the student, then the analysis would focus on identifying why the AI produced an anomalous outcome for that student and, if needed, re-score that student and any other student whose grades were affected in the same way.

Of course, much of this would be true of any grading process — one student’s anomaly might signal a more systematic failing in any grading process whether or not an AI is engaged. But the way in which the appeal process is designed needs to reflect the different ways in which humans and machines make decisions and the specific design of the AI used as well as how the decisions can be corrected.

For example, because AI awards grades on the basis of its model of relationships between various input data, there should generally be no need to look at the actual work of the students concerned, and corrections could be made to all affected students (those with similar input data characteristics) all at once. In fact, in many ways appealing an AI grade could be an easier process than appealing a traditional exam-based grade.

What’s more, with an AI system, an appeals process along the lines described would enable continuous improvement to the AI. Had the IBO put such a system in place, the results of the appeals would have produced feedback data that could have updated the model for future uses — in the event, say, that examinations are again cancelled next year.

***

The IBO’s experience obviously has lessons for deploying AI in many contexts – from approving credit, to job search or policing. Decisions in all these cases can, as with the IB, have life altering consequences for the people involved. It is inevitable that disputes over the outcomes will occur, given the stakes involved. Including AI in the decision-making process without carefully thinking through an appeals process and linking the appeals process to the algorithm design itself will likely end not only with new crises but potentially with a rejection of AI-enabled solutions in general. And that deprives us all of the potential for AI, when combined with humans, to dramatically improve the quality of decision-making.

Disclosure: One of the authors of this article is the parent of a student completing the IB program this year.

What is the International Baccalaureate?

And then came Covid-19.

Limited scope for appeal

What can we learn?

Leave a Comment Cancel Reply

AVA360

Magazine

Community

Channel

Channel

Newsletter

What Happens When AI is Used to Set Grades?

What is the International Baccalaureate?

And then came Covid-19.

Limited scope for appeal

What can we learn?

You may also like

Leave a Comment Cancel Reply

AVA360

Magazine

Community

Channel

Channel

Newsletter