DeepMind claims its AI performs better than International Mathematical Olympiad gold medalists
An AI system developed by Google DeepMind, Google’s leading AI research lab, appears to have surpassed the average gold medalist in solving geometry problems in an international mathematics competition.
The system, called AlphaGeometry2, is an improved version of a system, AlphaGeometry, that DeepMind released last January. In a newly published study, the DeepMind researchers behind AlphaGeometry2 claim their AI can solve 84% of all geometry problems over the last 25 years in the International Mathematical Olympiad (IMO), a math contest for high school students.
Why does DeepMind care about a high-school-level math competition? Well, the lab thinks the key to more capable AI might lie in discovering new ways to solve challenging geometry problems — specifically Euclidean geometry problems.
Proving mathematical theorems, or logically explaining why a theorem (e.g. the Pythagorean theorem) is true, requires both reasoning and the ability to choose from a range of possible steps toward a solution. These problem-solving skills could — if DeepMind’s right — turn out to be a useful component of future general-purpose AI models.
Indeed, this past summer, DeepMind demoed a system that combined AlphaGeometry2 with AlphaProof, an AI model for formal math reasoning, to solve four out of six problems from the 2024 IMO. In addition to geometry problems, approaches like these could be extended to other areas of math and science — for example, to aid with complex engineering calculations.
AlphaGeometry2 has several core elements, including a language model from Google’s Gemini family of AI models and a “symbolic engine.” The Gemini model helps the symbolic engine, which uses mathematical rules to infer solutions to problems, arrive at feasible proofs for a given geometry theorem.
Olympiad geometry problems are based on diagrams that need “constructs” to be added before they can be solved, such as points, lines, or circles. AlphaGeometry2’s Gemini model predicts which constructs might be useful to add to a diagram, which the engine references to make deductions.
Basically, AlphaGeometry2’s Gemini model suggests steps and constructions in a formal mathematical language to the engine, which — following specific rules — checks these steps for logical consistency. A search algorithm allows AlphaGeometry2 to conduct multiple searches for solutions in parallel and store possibly useful findings in a common knowledge base.
AlphaGeometry2 considers a problem to be “solved” when it arrives at a proof that combines the Gemini model’s suggestions with the symbolic engine’s known principles.
Owing to the complexities of translating proofs into a format AI can understand, there’s a dearth of usable geometry training data. So DeepMind created its own synthetic data to train AlphaGeometry2’s language model, generating over 300 million theorems and proofs of varying complexity.
The DeepMind team selected 45 geometry problems from IMO competitions over the past 25 years (from 2000 to 2024), including linear equations and equations that require moving geometric objects around a plane. They then “translated” these into a larger set of 50 problems. (For technical reasons, some problems had to be split into two.)
According to the paper, AlphaGeometry2 solved 42 out of the 50 problems, clearing the average gold medalist score of 40.9.
Granted, there are limitations. A technical quirk prevents AlphaGeometry2 from solving problems with a variable number of points, nonlinear equations, and inequalities. And AlphaGeometry2 isn’t technically the first AI system to reach gold-medal-level performance in geometry, although it’s the first to achieve it with a problem set of this size.
AlphaGeometry2 also did worse on another set of harder IMO problems. For an added challenge, the DeepMind team selected problems — 29 in total — that had been nominated for IMO exams by math experts, but that haven’t yet appeared in a competition. AlphaGeometry2 could only solve 20 of these.
Still, the study results are likely to fuel the debate over whether AI systems should be built on symbol manipulation — that is, manipulating symbols that represent knowledge using rules — or the ostensibly more brain-like neural networks.
AlphaGeometry2 adopts a hybrid approach: Its Gemini model has a neural network architecture, while its symbolic engine is rules-based.
Proponents of neural network techniques argue that intelligent behavior, from speech recognition to image generation, can emerge from nothing more than massive amounts of data and computing. Opposed to symbolic systems, which solve tasks by defining sets of symbol-manipulating rules dedicated to particular jobs, like editing a line in word processor software, neural networks try to solve tasks through statistical approximation and learning from examples.
Neural networks are the cornerstone of powerful AI systems like OpenAI’s o1 “reasoning” model. But, claim supporters of symbolic AI, they’re not the end-all-be-all; symbolic AI might be better positioned to efficiently encode the world’s knowledge, reason their way through complex scenarios, and “explain” how they arrived at an answer, these supporters argue.
“It is striking to see the contrast between continuing, spectacular progress on these kinds of benchmarks, and meanwhile, language models, including more recent ones with ‘reasoning,’ continuing to struggle with some simple commonsense problems,” Vince Conitzer, a Carnegie Mellon University computer science professor specializing in AI, told TechCrunch. “I don’t think it’s all smoke and mirrors, but it illustrates that we still don’t really know what behavior to expect from the next system. These systems are likely to be very impactful, so we urgently need to understand them and the risks they pose much better.”
AlphaGeometry2 perhaps demonstrates that the two approaches — symbol manipulation and neural networks — combined are a promising path forward in the search for generalizable AI. Indeed, according to the DeepMind paper, o1, which also has a neural network architecture, couldn’t solve any of the IMO problems that AlphaGeometry2 was able to answer.
This may not be the case forever. In the paper, the DeepMind team said it found preliminary evidence that AlphaGeometry2’s language model was capable of generating partial solutions to problems without the help of the symbolic engine.
“[The] results support ideas that large language models can be self-sufficient without depending on external tools [like symbolic engines],” the DeepMind team wrote in the paper, “but until [model] speed is improved and hallucinations are completely resolved, the tools will stay essential for math applications.”