Benchmarks
Last updated: May 7, 2026
Why we publish benchmarks
Ovaela is a wellness platform. Wellness products that touch health data should show their work. This page is the public record of how we measure Ovaela's output against published reference standards, what we pre-register before each run, and what we deliberately defer to a future external evaluation.
Ovaela is a general wellness product. It is not a medical device, it does not diagnose, treat, or prevent any condition, and it does not claim clinical-grade accuracy. The numbers below describe how Ovaela's outputs compare to expert-annotated reference labels in third-party datasets. They do not describe whether Ovaela is safe or effective for any specific clinical use.
Headline result
Our internal benchmark on HealthBench Consensus (a 34-dimensional conversational-rubric set authored by 262 physicians across 60 countries and released by OpenAI, Arora et al. 2025) scored 92.3% mean (Wilson 95% confidence interval 88.2 to 95.8%) on n=100 dev-split cases. This is an internal development benchmark, not a blind, independent, or production-performance result.
Ovaela's outputs were evaluated by an LLM judge against the Consensus rubric labels. The reference labels were produced by the dataset's physician panel and cross-validated among them; reported physician-physician agreement is 55 to 75% (an explicit Cohen’s κ was not published by the authors). Ovaela's outputs were not independently re-annotated by experts. Evaluation-side expert adjudication is planned for Q3 2026.
Source citation
- Arora, R. K. et al. HealthBench: Evaluating Large Language Models Toward Improved Human Health. arXiv preprint arXiv:2505.08775 (2025).
- License: MIT. Dataset published by OpenAI.
- Subset evaluated: HealthBench Consensus (n=100 cases drawn from the cross-validated Consensus subset).
- Reliability statistic cited from authors: 55 to 75% physician-physician agreement (Section 4.2 of the source paper).
Methodology
Three-way split
Every benchmark dataset Ovaela uses is partitioned into three splits, enforced at the database layer:
- Dev (1,000 to 5,000 cases): used for iterating on rules, prompts, and agents. Per-case inspection allowed.
- Validation (200 cases): aggregate accuracy only. Per-case inspection is disabled in the runner so we cannot teach the system to specific failures.
- Blind (500 cases): pre-registered, sealed, and read once per calendar quarter. A database trigger enforces a unique constraint per quarter so a blind run cannot be repeated within the quarter.
Pre-registration
Before any blind run, a pre-registration file is committed to the repository specifying the dataset, exact n, the SHA-256 of the sorted case identifiers, the scoring rubric, the model name and version, the grader version SHA, the expected confidence-interval width at the anticipated accuracy, the budget cap, and the ablation pipelines that will be run on the same cases. The result file references the pre-registration SHA so any post-hoc edit is detectable in git history.
The most recent run's full results, methodology, and integrity disclosures are documented internally and summarized on this page.
Annotation κ vs evaluation κ
Two reliability statistics often get conflated in health-AI marketing. Ovaela separates them strictly:
- Annotation κ measures how well the reference-label producers agreed among themselves on the ground-truth answers. We citethe annotation reliability that source datasets publish (or note when authors did not publish an explicit Cohen’s κ).
- Evaluation κwould measure how well credentialed experts agree that Ovaela's outputs match the reference labels. Ovaela has not produced this yet. Until a credentialed advisory board is seated (target Q3 2026), Ovaela reports
evaluation_reliability_method: llm_judge_only.
Ablation reporting
Each blind run reports the same cases through multiple pipelines (rules-only baseline, single-agent, multi-agent, +memory, +consensus). Components that degrade accuracy versus the prior baseline are not shipped to production, regardless of in-development performance.
Multi-dataset sanity check
Any advertised number is paired with a second number from a different dataset on the same pipeline in the same quarter. If the gap exceeds 10 percentage points, the discrepancy is resolved before publication.
Required disclosures with every published number
For each benchmark run, Ovaela records the following metadata. The headline result above lists the key figures (n, confidence interval, dataset, split, annotation authority, and evaluation method); the complete set is documented internally with the run and available on request:
- Exact n
- Wilson 95% confidence interval (lower and upper)
- Dataset name, split, and cycle
- Model name, version, and release date
- Benchmark date and pre-registration file SHA
- Grader and rules git SHAs
- Ablation breakdown
- Annotation authority (third-party physician panel, government curation, single-editor curation, internal, or not applicable)
- Annotation physician count (when applicable) and the source-published reliability statistic verbatim
- Evaluation reliability method (
llm_judge_only,llm_judge_with_physician_kappa,physician_only, orrules_only)
Open questions and what comes next
A credentialed advisory board is the prerequisite for evaluation-side adjudication. Once seated (target Q3 2026), the same n=100 HealthBench Consensus subset will be re-graded by experts independent of Ovaela and a published Cohen’s κ will be reported.
Ovaela is also preparing a separate public benchmark repository (MIT license) with the harness, configurations, and result artifacts so the methodology is reproducible outside Ovaela's codebase.
Questions or peer-review feedback are welcome at admin@ovaela.ai.
Ovaela provides wellness information, not medical advice. It does not diagnose, treat, or prevent any condition, and it is not a substitute for professional medical care. The benchmarks above describe how Ovaela's outputs compare to third-party reference labels; they do not describe whether Ovaela is safe or effective for any specific clinical use. Always consult a qualified healthcare provider before making health decisions. Powered by AI, not a licensed healthcare professional.