3 Comments
User's avatar
Vadim's avatar

Omggggg thanks for this new piece of wisdom, figuring out how to make something like this happen was totally in my wish list and essential for theorem marketplaces to work (if they're ever going to work at all).

Thoughts:

1. I totally think we ("we" in general, humans) should create things like these even when we expect it won't actually become an additional infrastructure for mathematical progress or whatever. When we come up with novel and interesting things, I think it's worth creating them even when it's not obvious how they're going to be useful, because we can't possibly think of all uses people might find — indeed people themselves might not even realize they need this tool until they realize it exists and go, "hmm, what about my use case?". "Do good and throw it into the sea", as the saying goes; implement things and leave them for people to find, because these things are going to find interesting consequences for themselves anyway.

One example is that I implemented the theorem marketplace that had many problems, including the one described in this post, and it (at least partly?) inspired you to make this post. And that means it was totally worth the effort!

I would go ahead and just implement this, except I could barely stay on top of the theorem marketplace project when it was actively being developed, and I think I would need either people helping me create it open-source style or actually receive a salary or a grant for it (for I cannot afford yet to just retire and pull any epistemology-creating shit I want). You've fired my imagination, though, I don't know what I'm going to do now.

2. Not sure we need to decide on one factorization to fund; perhaps we could just distribute funds according to how promising the factorizations are? (Maybe with some cutoff value?)

3. What about frontrunners, e. g. you submit a factorization and someone submits an improved version of the same factorization and gets paid for both their improvement and your initial work, instead of doing the recursive factorization thing?

4. You mentioned markets about the length of proofs, and also that futarchy would decide which factorization to fund. This reminds me: Lean Copilot has integration with the Lean Progress (https://leandojo.org/leanprogress.html) model, which, for a given unfinished proof, predicts how many steps remain to finalize it. Neither the dataset nor the model are available in open source. For the sake of my project of using AlphaEvolve for theorem proving (https://ratandtiger.substack.com/p/alphaevolve-lean-theorem-prover), I started assembling a dataset to train a corresponding model (intending to make both open-source), but got sidetracked. If I finished this, it could be used for such use cases (to inform traders), too.

(Also, damn, I owe everyone updates about my projects...)

Bolton's avatar

1. Yes this is a valid point, blue-sky research has value.

2. Yes, I even think that, related to my and other's previous posts on the causality of prediction markets, I think there is room for things like randomly choosing factorizations to attempt regardless of their promise to ensure the system is calibrated, or doing markets that analyze the overlap of different approaches so that we can have flexible backup plans, or making markets about particular chokepoint theorems that will be critical no matter what path is taken, lots of ideas you could have here.

3. This is a tricky problem with futarchy in general, I am reminded of this Hanson post https://www.overcomingbias.com/p/futarchy-and-the-transfer-problem

4. Seems like great work. One thing I have noticed is that many Lean-focused LLMs, as well as many general-purpose LLMs set to use Lean, produce proofs that are longer and less well-factored than they could be. I am interested in seeing more publicly accessible work on models that simply refactor working lean code, see more of my thoughts here https://github.com/BoltonBailey/mathlib4/issues/28

Vadim's avatar

2 & 3. Having something like chokepoint theorem detection actually seems quite related, in this case, to the problem of frontrunners: it's basically several types of trying to predict intersection between several approaches. We could, like, make a market that resolves to percentage of intersection of different approaches (and use *that* to discourage frontrunners somehow?..), except some approaches are either not getting funded or not succeeding, so it's unclear how the market would resolve in that case.

(Now I'm thinking that the problem that you're describing in the original post is still haunting us; it's a winner-takes-all situation moved from theorem bounties to factorization funding...)

4. It's harder to get nice generative models than nice classification/regression models (which is what LeanProgress is). (Also, I don't happen to have crazy amounts of compute, and RLHFing things is generally difficult, I think?).

For starters, we could apply the same approach of having an imperfect verifier / estimator telling us how "clean" or messy a given piece of code is. This could me then used for:

· RLHFing LLMs into producing not only working, but also non-terribly-structured proofs,

· actually using AlphaEvolve or an open source analogue for restructuring input code (and actually be able to take bigger projects / wider dependencies into account if our estimator is good enough!)

The problem is, I have no idea how to collect such data aside from hiring Lean programmers as data annotators and asking them questions along the lines of "how much do you hate this code?". Maybe we could parse the code itself, traverse the ASTs and devise some measure of terribleness / factoredness. Or — wait for it — ask LLMs for data annotation. It may be that they suck at producing well-factored Lean code, but they may be more able to estimate code stink with the right prompting and if we employ a wisdom of the crowds kind of approach.

Is this another period in my life to get sucked into a passion project like this? Might actually be useful to someone...

UPD: sorry, I only now see that you have a bunch of heuristics in your issue. could totally be used as input / part of the features for the estimator!