Benchmarks

Production-locked numbers. Full corpora. No cherry-picked seeds.

Every metric on this page is from a production training run on the full data corpus — not a smoke-run inflation, not a sub-sampled subset, not a best-of-N seed pick. Test AUROC, AUPRC, ECE, coverage, and bias values are quoted directly from timestamped log files. Raw logs available to design partners under NDA.

MIMIC-IV v3.1 clinical prediction

Eight W-lanes locked on the full 546K-admit cohort.

Five-fold cross-validation plus a temporal test split (training data pre-2020, test 2020+) on the full mimic4_full ingestion. Calibration via auto-pick between Beta+Bayes and isotonic per workload using validation ECE; calibrator chosen per workload, not per dataset family.

Clinical prediction lanes Full corpus, 5-fold CV + temporal test split

LaneOutcomeCohortTest AUROCCalibrationPublished reference
W1In-hospital mortalityn=546K admits0.9476Beta-Bayes ECE 0.0024Tomasev 2019: 0.92
W230-day all-cause readmissionn=534K excl. deaths0.7034Isotonic ECE 0.0040Rajkomar 2018: 0.75–0.76 (different cohort)
W3Sepsis-3 onset (first 48h ICU)n=74,829 ICU stays0.8908AUPRC 0.8710Saria 0.83–0.85, Komorowski 0.85, Hyland 0.82–0.86
W5KDIGO AKI Stage 1+ (first 48h)n=75K ICU stays0.8371AUPRC 0.6804Tomasev 2019: 0.82, Koyner 0.82
W11 (Cox)Mortality time-to-eventn=546K admitsc-index 0.8019IPCW c 0.6468 · IBS 0.0841Cheng 2019 LSTM: 0.81–0.85
W11 (AFT)Mortality time-to-eventn=546K admitsc-index 0.7707IPCW c 0.6375dual-comparator survival

All five lanes locked under per-workload auto-pick calibrator. Calibration choice (Beta+Bayes vs isotonic) is data-driven from validation-set ECE, not preset.

W12 anticoagulation sensitivity pentagon Heparin vs LMWH, MIMIC-IV ICU cohort, n=153K admits

EstimatorY_bleed ATT [95% CI]Y_vte ATT [95% CI]Direction
AIPW v3 (base)+0.0253 [+0.0037, +0.0468]+0.0766 [+0.0498, +0.1056]Both positive (opposite of RCT)
AIPW v4 (severity-augmented)+0.0271+0.0794Severity isn't the confounder
AIPW v5 (trajectory)+0.0301+0.0802Trajectory isn't either
IV-LATE (2SLS, F=67,250)+0.0625 [+0.048, +0.079]+0.0102 [-0.002, +0.021] ✓ CI ⊇ 0IV recovers RCT non-inferiority on Y_vte
Neural counterfactual+0.0255 (AIPW-equivalent)deferred (M5 Pro 64GB)Confirms unobservable confounder
Rosenbaum Γ-boundY_bleed Γ_zero = 1.06 · Y_vte Γ_zero = 1.17Very sensitive (both)

Five methods, same cohort. None recovers RCT direction on Y_bleed (Greinacher/Geerts RCTs); IV-LATE partially recovers on Y_vte. Γ = 1.06 = a 6% odds-ratio shift from an unmeasured confounder flips the bleed estimate. Quantitatively confirms confounding-by-indication signal — the open-confounder benchmark contribution.

W13 anticoagulation positive control DOAC vs warfarin DR-ATT, AFib+CKD subgroup

OutcomeATT [95% CI]DirectionRCT consistency
Y_stroke+0.0084 [+0.0012, +0.0168]DOAC marginally worseRE-LY / ROCKET-AF / ARISTOTLE / ENGAGE-AF AFib+CKD subgroups
Y_bleed-0.0396 [-0.0521, -0.0277]DOAC ~27% relative reductionRCT-consistent

Cohort n=8,990 → 4,220 post Crump-2009 trim (53% dropped). First observational W-lane to recover RCT direction on both outcomes. Trim sensitivity α = 0.05/0.10/0.15 robust (Δ ATT < 0.004).

FAERS pharmacovigilance

Pipeline B severity classifier — full 20M-row corpus.

Severity classification Temporal split (pre-2020 train, 2020-2025 test)

SplitCohortAUROCAUPRCECEBrier
Trainn=420,294 (pre-2020)0.95620.86160.09460.0919
Validationn=170,7160.88580.78360.17510.1692
Test (2020–2025)n=158,7320.88720.76800.17440.1669

Test AUROC is approximately +9pp above the demographics-only Bate 2019 baseline (~0.78). ECE is recoverable via capability-conditional abstain (built into the platform) on the same test split; the raw ECE is reported here without that post-processing for transparency.

ACIC22 causal-inference challenge

V3 lock on the full 3,400-cohort canonical.

Track-2 substantial-equivalence claim Full canonical, V3 lock

MetricValueNotes
Bias+19.26per-cohort average
RMSE28.80full-cohort canonical
Coverage77.53%95% CI coverage of true treatment effect
Width78.04average CI width
Runtime40.86sper dataset

Coverage 77.53% on the full 3,400-cohort canonical (up from 7% on a flawed V1 lock; 11× lift via three ensemble bug fixes). Position vs leaderboard: not at DiConfounder pace (~+8 bias) but within the publishable methodological-comparison tier — useful for the second-method scorecard in any RWE paper.

DMPNN MoleculeNet (drug-discovery lane)

Chemprop v2 parity on scaffold splits.

Native PyTorch D-MPNN 3 seeds, Bemis-Murcko scaffold splits, no ensembling

BenchmarkRosenbound D-MPNN AUROCChemprop v2 referenceWithin σ overlap
BBBP (blood-brain barrier penetration)0.9144 ± 0.01130.897 ± 0.012Yes
BACE (β-secretase inhibition)0.8861 ± 0.0010.859 ± 0.024Yes
HIV (replication inhibition)0.7937 ± 0.01490.776 ± 0.020Yes

In-house PyTorch rewrite (approximately 500 lines, no Chemprop runtime dependency). 5-test gradient-check suite green. Published-baseline parity under stricter conditions than the reference paper. Drug Discovery vertical detail →

HNSI clinical-NER

MIMIC-IV-Note v2.2 — 1.64M clinical notes processed.

NER + sectionizer + negation/context extraction en_core_sci_md + medspacy

StageVolumeThroughputCoverage
W1 radiology corpus570K notes / 572 chunks / 141 MB output9.7 notes/sec~30M entities, 13–18% negation rate
W2 discharge + radiology1.07M notes / 1,071 chunks / 975 MB output9.7 notes/sec~50M+ entities
Total processed1.64M clinical notesen_core_sci_md NER + medspacy_sectionizer + ConText80M+ entities with negation/historical/family attributes

Stage 1 cohort builder: 4.3 min via DuckDB. NER extraction throughput stable at 9.7 notes/sec on commodity hardware (numpy < 2.0 + spaCy 3.7.4 + thinc 8.2.5 + en_core_sci_md).

Source logs — raw training and evaluation logs for every benchmark above (MIMIC-IV W1/W2/W3/W5/W11, FAERS Pipeline B, ACIC22 Track-2 V3 lock, W12 sensitivity pentagon, DMPNN MoleculeNet, and the HNSI clinical-NER extraction) are retained as immutable run artifacts with SHA-pinned commit references in the VBSM ledger. Full log inventory and per-run provenance available to design partners under NDA on request.

Watch the product walkthrough at rosenbound.ai — three moments that define the platform: the Cognitive Validation Report refusing incoherent data, the live Γ-bound sensitivity visualization, and the reproducibility certificate generated on every study. The full platform stays gated for Founding Partners.

Watch the preview →

pip install rosenbound  —  Official Python SDK for programmatic access: cohort upload, sensitivity-bounded study runs, and reproducibility certificate retrieval. Apache 2.0; Pydantic v2 typed; py.typed for IDE autocomplete + mypy. Platform access gated by Bearer token + RBAC + tenant scoping — the SDK is open, the audit substrate is not.

View on PyPI →

Want these numbers on your own cohort?

Founding Partner Program includes a benchmark co-authorship clause: Rosenbound runs the full pentagon on your in-house cohort (under your IP terms) and the resulting methodology paper carries your team as co-authors.

<