axiom-explorer: an honest experiment in LLM-assisted mathematical discovery

The first preprint of a small experiment of mine is now on Zenodo. The preprint describes a tool I built —not a theorem I proved— and the workflow it implements. I want to tell you about it without the usual rhetoric about artificial intelligence solving mathematics, because the experiment is more modest than that and its honesty is precisely the part I think is worth sharing.

The question

There is a long-standing tension in mathematics between the speed at which new foundational ideas are produced and the speed at which we, as a community, can explore their consequences across branches. When a fresh axiomatic framework appears —condensed mathematics in 2019, the synthetic Stone duality programme in 2024— it takes years for the cross-pollination with other branches to crystallise into citable results. Most of that time is taken up not by hard proofs, but by the simpler work of finding out where two communities are quietly talking about the same thing in different vocabularies.

The question is whether a careful, governed workflow can do some of that finding-out work and present the results as falsifiable invitations to expert review.

The instrument

axiom-explorer is the instrument. It is a Python harness with a small set of external tools (arXiv, Mathlib4, Z3, SymPy, Lean 4, latexmk) and a large language model as the orchestrator. The orchestrator is the part most likely to make people nervous —reasonably so— and the design of the tool is mostly about constraining what the orchestrator is allowed to do.

The workflow runs in five phases:

Bibliometric mapping. The harness picks a small set of “modern productive” axiomatic seeds from distinct branches —in this run: the univalence axiom from homotopy type theory, condensed mathematics, perfectoid spaces, and synthetic Ricci curvature— and queries arXiv across their pairwise combinations using five different search strategies. The output is a per-pair density score.
Formal-state inventory. For each seed, the harness scans the Mathlib4 library and adjacent Lean and Agda projects, counting files, definitions, explicit TODOs.
Per-pair dossiers. For the pairs that survive Phase 0, it builds a dossier integrating literature, bridging functors, and small-scale numerical or symbolic experiments.
Deep dives. Surviving candidates from the dossiers are promoted to their own dossier with precise statement, evidence, and an explicit confidence level on a four-step ladder.
Synthesis and reporting. The surviving candidates and the methodology meta-analysis become a draft preprint that is then iterated against multiple rounds of AI peer review before any human reviewer is approached.

The confidence ladder

Every claim the harness surfaces gets a label:

L0 — verified mechanically. Z3 found no counterexample. Lean type-checks the axioms.
L1 — strong. A standard, named result with a citation.
L2 — plausible. Argued by the workflow, not directly verified.
L3 — speculative. A guess emerging from the cross-search, recorded as a question, never as an assertion.

This is the single most important downstream-protective device. It is the difference between a preprint that promises folklore-style certainty and one that hands the reader exactly the level of trust the workflow earned.

The stop rules

The orchestrator has four hard stop rules that the workflow refuses to bypass:

It does not publish to Zenodo or anywhere else without the human author’s explicit pressing of the publish button.
It does not send emails or any external communication; drafts are produced, the human author dispatches.
It does not claim novelty beyond a very narrow form: “this formulation does not appear as such in any single source we found.”
It does not impersonate mathematical authority. The author identity and the LLM-assisted production method are declared on the title page of every preprint output.

These are checked at code-review boundaries. The harness refuses to mark a candidate as “done” until the confidence ladder is filled in and the disclosure is in place.

The peer-review loop

This is the part I found most surprising during the experiment. The Phase 5 draft is submitted to several distinct LLM-based reviewers (different model families, different prompts) before any human review. Each round produces a structured critique that the orchestrator must address point by point, with explicit retraction of any earlier claim the review proved wrong.

In the case study we ran, four AI peer-review rounds (v1→v1.1, v1.1→v1.2, v1.2→v1.2.1, v1.2.1→v1.2.2) caught three substantive mathematical errors that the original draft contained:

An invariance failure in a key size invariant.
A cardinality bound on free profinite groups stated in a form inconsistent with the geometric attestations being cited.
A model-theoretic example that I had classified as trivial when in fact it is the absolute Galois group of the rationals.

A fifth round caught a citation error (an arXiv identifier pointing to a paper on turbulence rather than the intended work on Lascar groups). None of these would have been caught by a single LLM generation. All of them were caught before any expert human reviewer was approached.

The loop is not a substitute for expert review. It is, however, a remarkably cheap and effective first line of defence.

The case study, briefly

I ran the workflow on one quadruple of seeds. What came out is the subject of a separate, second preprint —the companion conjecture paper, forthcoming— that I will deposit under its own DOI in the next few days. The short summary is that the harness routed me to two recent and striking papers I had not read carefully (the Haine et al. work on the condensed homotopy type of a scheme, and Haine’s own 2026 paper unifying the proétale fundamental group of a scheme with the Lascar group of a complete first-order theory), and that cross-reading them suggested a candidate cardinality envelope that, as far as I could check, does not appear written down in any single source.

That candidate is presented as a falsifiable conjecture, with explicit confidence labels, and the companion preprint is an open invitation to specialists in condensed mathematics, model theory, and set theory to do exactly what specialists do: confirm it as folklore, find a counterexample, or refine it. Any of the three is an excellent outcome.

What I think the experiment is worth

I am not a doctorate-holding mathematician. I work at the boundary of software engineering and formal methods, and I have no business writing about $\infty$ -topoi without explicit help. The first thing the workflow does is make that obvious in every artefact it produces.

What I do think this experiment is worth, taken at the right scope, is:

It is honestly publishable as a registered observation, with a DOI, exactly because the framing is conservative and the stop rules are enforced.
It is cheaper than I expected to produce a preprint that survives an initial sweep of LLM peer review. The marginal cost of the next case study, with a different quadruple of seeds, is small.
It is a small piece of evidence that the current generation of language models, used carefully and under enforced stop rules, can act as a reasonable cross-search assistant for non-specialists, without overclaiming and without polluting the citation graph.

The risks are real and I name them explicitly in §6.2 of the paper: reputation laundering, citation-graph pollution, asymmetric burden on human reviewers. The defence is structural —the four stop rules and the confidence ladder— and I do not claim the defence is sufficient.

Why I am putting my name on this

I would not be doing this if the worst case were “I look silly”. The worst case for me is that an expert finds an embarrassing error and the preprint is corrected in v1.1. That is fine. The worst case for the community, if the experiment is right at all, is that workflows like this one start producing preprints faster than humans can review them. The methodology paper takes that risk seriously and addresses it directly.

I would rather contribute carefully and risk being told I am missing something obvious than not contribute at all. The thing I have is energy and a software engineer’s patience; the thing the experiment provides is a way to convert that patience into something a specialist can quickly accept, refute, or contextualise.

What is next

The case study preprint —the companion paper to this methodology one— will be deposited under its own DOI in the next few days, with the explicit invitation to specialists to engage with it. I will then write a short follow-up post here about its contents in the same public-facing style.

After that, if the methodology survives expert contact at all, I plan to try a second case study with a deliberately different seed quadruple, to test whether the workflow surfaces a different kind of finding or only the same shape of result.

If you read either preprint and have a counterexample, a known reference, or a sharpening, I would be genuinely grateful. The whole point of registered conjectural observations is to invite that.

The methodology paper is open access (CC-BY-4.0) under DOI 10.5281/zenodo.20184068. The repository is at github.com/Dredok/axiom-explorer.