AlphaGeometry and National Data Libraries
What one tiny database in Google DeepMind’s latest breakthrough tells us about how new scientific infrastructure can support UK AI opportunities
It is probably fair to say that the vibes on AI are reaching the “trough of disillusionment”. And that's OK. Technology follows a predictable pattern of experimentation, adaptation, and adoption. The initial hype often gives way to a period of skepticism, but this phase is essential for refining and understanding the true capabilities of new technologies. Besides, Twitter is acting as a great archive for the AI nimbys dropping many a cynical take. Let them enjoy their retweet endorphins while the real progress continues.
One area where the trough of disillusionment is certainly not being felt is in “AI for Science.” In this domain, remarkable strides are being made, exemplified by groundbreaking work from research projects like CoScientist (a tool that effectively performs many of the analytical functions of a chemistry lab researcher) and machine-led discoveries in quantum entanglement.
The UK is very lucky to be home to the world’s leading AI for science company, in DeepMind. Their latest breakthrough, AlphaGeometry, represents the next leap forward in the AI for science agenda.
For the uninitiated, the London-based AI company has produced a program that has achieved a silver-medal performance at this year's International Mathematical Olympiad. Tim Gowers, who was a recipient of the Fields Medal, has a good thread looking at the significance of this, with lots of useful caveats.
DeepMind published a paper on AlphaGeometry in Nature with all the details you could want. But rather than focus on the technologies being used, I wanted to look at something else.
Within the AlphaGeometry paper lies a nugget of gold, offering valuable lessons for those in the UK government working on both the AI opportunities review and the national data library project.
Lean and Bop
The nugget for discussion here is a mathematical proof assistant and database, known as Lean. Lean was created in 2013, by a Microsoft Research programmer.
Initially an experimental proof assistant, Lean evolved significantly in 2017, capturing the attention of mathematicians worldwide. The development of "mathlib," a vast community-driven library, further propelled Lean's popularity, formalising over 127,000 theorems and 70,000 definitions by 2023. Now, Lean even has the ability to generate efficient computer code.
So how was this used by DeepMind?
“AlphaProof (which was used alongside AlphaGeometry) is a system that trains itself to prove mathematical statements in the formal language Lean… When presented with a problem, AlphaProof generates solution candidates and then proves or disproves them by searching over possible proof steps in Lean.”
DeepMind is not the only AI company to leverage this resource, with both Meta and OpenAI using Lean in projects in recent years. In the last few weeks, an LLM co-pilot (LeanDojo) that will leverage Lean for automating proofs has been released.
What was interesting about Lean in the context of the DeepMind example, is that Lean is a relatively small data source. Despite its size, the Lean datasets are highly valuable because they contain mathematical ground truths that can be leveraged for significant scientific advancements.
Every now and then, a few bits of infrastructure come along that lay the foundations for R&D in a field of AI. In biology, we have the Protein Data Bank, which is the best friend of many a computational biologist. Lean has potential to play a similar role for AI-powered mathematical proofs.
National Data Libraries for Science
However, many scientific disciplines lack their own equivalents of such valuable datasets. This is one of the key areas where the national data library should focus its efforts. By creating and maintaining lean datasets for different fields, the national data library can serve as a catalyst for growth in machine learning-based science.
Two equivalent datasets that could be made include:
Lean for physics proofs: formalising physical concepts, laws, and theories. This approach could bring greater mathematical rigor to theoretical physics, allowing for formal verification of complex derivations, consistency checks of theories, and potentially uncovering new insights through precise formulations of physical principles.
GitLab for chemistry: repositories of chemistry information including molecular structures and spectral data; educational modules on applications of different structures.
Government isn’t going to be able to compete with industry to produce leading tools. What it can do to add value is coordinate different players to produce new kinds of assets that benefit the entire ecosystem. New Lean-like databases for areas such as physics and chemistry should be a priority.
New science governance organisations are playing an important role in developing this kind of infrastructure. A Focused Research Organisation (which concentrate their resources and expertise on investigating a specific, well-defined problem or area of study), Lean FRO, is currently responsible for the management of Lean. The recent Sunak government committed to testing out new FROs, and this is a domain where it would make sense to establish some new ones.
As AI progresses through its cycle of disillusionment, it’s important to recognise the areas where genuine progress is being made. AI for Science is one such area, with initiatives like AlphaGeometry paving the way for future advancements. By focusing on creating valuable datasets for various scientific disciplines, the national data library can play a crucial role in fostering growth and innovation.
Let Gary Marcus be Gary Marcus; the real breakthroughs are just beginning.