Skip to content

Conversation

@ajdavis
Copy link
Contributor

@ajdavis ajdavis commented Jan 18, 2026

Closes #4623.

@ajdavis ajdavis force-pushed the issue-4623-filter-condition branch from 0b28145 to 76d3207 Compare January 18, 2026 22:19
@ajdavis ajdavis changed the title permit up to 99% assume() failures #4623 permit up to 99% assume() failures Jan 19, 2026
Copy link
Member

@Liam-DeVoe Liam-DeVoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ajdavis!

I'd love to see two tests:

  • If we unconditionally assume(False), Hypothesis runs exactly INVALID_THRESHOLD_BASE test cases before erroring.
  • If we assume(False) for all but n test cases, Hypothesis runs exactly INVALID_THRESHOLD_BASE + n * INVALID_PER_VALID test cases before erroring.

Comment on lines 159 to 171
# Statistical thresholds for assumption satisfaction rate.
# We want to stop when we're 99% confident the true valid rate is below 1%.
#
# With k valid examples, we need n invalid examples such that:
# P(seeing <=k valid in n+k trials | true rate = 1%) <= 1%
#
# For k=0: (0.99)^n <= 0.01 → n >= ln(0.01)/ln(0.99) ~= 459
# Each additional valid example adds ~153 to the threshold (solving the
# cumulative binomial for subsequent k values).
#
# Formula: stop when invalid_examples > INVALID_THRESHOLD_BASE + INVALID_PER_VALID * valid_examples
INVALID_THRESHOLD_BASE = 459
INVALID_PER_VALID = 153
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems cheap enough that we should define INVALID_THRESHOLD_BASE, INVALID_PER_VALID = _calcuate_thresholds() and compute it at runtime, in case we ever want to change the threshold?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I generated _calculate_thresholds() with Claude, of course, and I don't understand it. But it does produce 459 and 153, and I did a quick experiment to prove to myself that those numbers are reasonable.

Copy link
Member

@Liam-DeVoe Liam-DeVoe Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw @Zac-HD I sat down with this formula and I'm pretty skeptical we can get a reasonable closed-form expression for INVALID_PER_VALID: https://claude.ai/share/a7c1e28e-8f88-48d2-a9ff-d8bb89461cb4.

I think INVALID_THRESHOLD_BASE is correct, but that an estimation of increment based on vibes (and pegged to a particular confidence rate) isn't particularly useful for future us. I'd advocate for keeping the bayesian closed form of INVALID_THRESHOLD_BASE, but stating the increment as 1 / min_valid_rate rather than a bayesian confidence estimate.

This will be systematically more aggressive about exiting than our stated 99% confidence interval. I think this is fine; we currently had/have no confidence interval on master.


(bonus / longer chat that advocates for sqrt as a better approximation. I remain skeptical and do not stand by this.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read both chats and I mostly followed them, though I think you're a bit more knowledgable about stats than me. I gather you don't want to add a SciPy dependency, nor do expensive calculations after each trial to determine whether to continue, but on the other hand we're not sure about any of the suggested approximations. I'm not passionately attached to this PR. I'll leave the final decision to you two.

Copy link
Member

@Zac-HD Zac-HD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving on the basis that the approximation seems fine to me, and is at minimum a huge improvement over the status quo - in addition to adjusting the actual rate.

Thanks again @ajdavis!

@Zac-HD Zac-HD merged commit 42126d6 into HypothesisWorks:master Jan 28, 2026
78 checks passed
@Liam-DeVoe
Copy link
Member

Zac got to this in the middle of my doing a bayesian deep dive 😄: #4650. Thanks a lot for contributing here @ajdavis !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Option to keep generating examples even if < 10% satisfy assumptions

3 participants