-
Notifications
You must be signed in to change notification settings - Fork 629
permit up to 99% assume() failures #4643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
permit up to 99% assume() failures #4643
Conversation
0b28145 to
76d3207
Compare
Liam-DeVoe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ajdavis!
I'd love to see two tests:
- If we unconditionally
assume(False), Hypothesis runs exactlyINVALID_THRESHOLD_BASEtest cases before erroring. - If we
assume(False)for all butntest cases, Hypothesis runs exactlyINVALID_THRESHOLD_BASE + n * INVALID_PER_VALIDtest cases before erroring.
| # Statistical thresholds for assumption satisfaction rate. | ||
| # We want to stop when we're 99% confident the true valid rate is below 1%. | ||
| # | ||
| # With k valid examples, we need n invalid examples such that: | ||
| # P(seeing <=k valid in n+k trials | true rate = 1%) <= 1% | ||
| # | ||
| # For k=0: (0.99)^n <= 0.01 → n >= ln(0.01)/ln(0.99) ~= 459 | ||
| # Each additional valid example adds ~153 to the threshold (solving the | ||
| # cumulative binomial for subsequent k values). | ||
| # | ||
| # Formula: stop when invalid_examples > INVALID_THRESHOLD_BASE + INVALID_PER_VALID * valid_examples | ||
| INVALID_THRESHOLD_BASE = 459 | ||
| INVALID_PER_VALID = 153 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems cheap enough that we should define INVALID_THRESHOLD_BASE, INVALID_PER_VALID = _calcuate_thresholds() and compute it at runtime, in case we ever want to change the threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I generated _calculate_thresholds() with Claude, of course, and I don't understand it. But it does produce 459 and 153, and I did a quick experiment to prove to myself that those numbers are reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw @Zac-HD I sat down with this formula and I'm pretty skeptical we can get a reasonable closed-form expression for INVALID_PER_VALID: https://claude.ai/share/a7c1e28e-8f88-48d2-a9ff-d8bb89461cb4.
I think INVALID_THRESHOLD_BASE is correct, but that an estimation of increment based on vibes (and pegged to a particular confidence rate) isn't particularly useful for future us. I'd advocate for keeping the bayesian closed form of INVALID_THRESHOLD_BASE, but stating the increment as 1 / min_valid_rate rather than a bayesian confidence estimate.
This will be systematically more aggressive about exiting than our stated 99% confidence interval. I think this is fine; we currently had/have no confidence interval on master.
(bonus / longer chat that advocates for sqrt as a better approximation. I remain skeptical and do not stand by this.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read both chats and I mostly followed them, though I think you're a bit more knowledgable about stats than me. I gather you don't want to add a SciPy dependency, nor do expensive calculations after each trial to determine whether to continue, but on the other hand we're not sure about any of the suggested approximations. I'm not passionately attached to this PR. I'll leave the final decision to you two.
Zac-HD
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving on the basis that the approximation seems fine to me, and is at minimum a huge improvement over the status quo - in addition to adjusting the actual rate.
Thanks again @ajdavis!
Closes #4623.