Benchmarks — Production-Locked Numbers on Full Corpora

MIMIC-IV v3.1 clinical prediction

Eight W-lanes locked on the full 546K-admit cohort.

Five-fold cross-validation plus a temporal test split (training data pre-2020, test 2020+) on the full mimic4_full ingestion. Calibration via auto-pick between Beta+Bayes and isotonic per workload using validation ECE; calibrator chosen per workload, not per dataset family.

Clinical prediction lanes Full corpus, 5-fold CV + temporal test split

Lane	Outcome	Cohort	Test AUROC	Calibration	Published reference
W1	In-hospital mortality	n=546K admits	0.9476	Beta-Bayes ECE 0.0024	Tomasev 2019: 0.92
W2	30-day all-cause readmission	n=534K excl. deaths	0.7034	Isotonic ECE 0.0040	Rajkomar 2018: 0.75–0.76 (different cohort)
W3	Sepsis-3 onset (first 48h ICU)	n=74,829 ICU stays	0.8908	AUPRC 0.8710	Saria 0.83–0.85, Komorowski 0.85, Hyland 0.82–0.86
W5	KDIGO AKI Stage 1+ (first 48h)	n=75K ICU stays	0.8371	AUPRC 0.6804	Tomasev 2019: 0.82, Koyner 0.82
W11 (Cox)	Mortality time-to-event	n=546K admits	c-index 0.8019	IPCW c 0.6468 · IBS 0.0841	Cheng 2019 LSTM: 0.81–0.85
W11 (AFT)	Mortality time-to-event	n=546K admits	c-index 0.7707	IPCW c 0.6375	dual-comparator survival

All five lanes locked under per-workload auto-pick calibrator. Calibration choice (Beta+Bayes vs isotonic) is data-driven from validation-set ECE, not preset.

W12 anticoagulation sensitivity pentagon Heparin vs LMWH, MIMIC-IV ICU cohort, n=153K admits

Estimator	Y_bleed ATT [95% CI]	Y_vte ATT [95% CI]	Direction
AIPW v3 (base)	+0.0253 [+0.0037, +0.0468]	+0.0766 [+0.0498, +0.1056]	Both positive (opposite of RCT)
AIPW v4 (severity-augmented)	+0.0271	+0.0794	Severity isn't the confounder
AIPW v5 (trajectory)	+0.0301	+0.0802	Trajectory isn't either
IV-LATE (2SLS, F=67,250)	+0.0625 [+0.048, +0.079]	+0.0102 [-0.002, +0.021] ✓ CI ⊇ 0	IV recovers RCT non-inferiority on Y_vte
Neural counterfactual	+0.0255 (AIPW-equivalent)	deferred (M5 Pro 64GB)	Confirms unobservable confounder
Rosenbaum Γ-bound	Y_bleed Γ_zero = 1.06 · Y_vte Γ_zero = 1.17		Very sensitive (both)

Five methods, same cohort. None recovers RCT direction on Y_bleed (Greinacher/Geerts RCTs); IV-LATE partially recovers on Y_vte. Γ = 1.06 = a 6% odds-ratio shift from an unmeasured confounder flips the bleed estimate. Quantitatively confirms confounding-by-indication signal — the open-confounder benchmark contribution.

W13 anticoagulation positive control DOAC vs warfarin DR-ATT, AFib+CKD subgroup

Outcome	ATT [95% CI]	Direction	RCT consistency
Y_stroke	+0.0084 [+0.0012, +0.0168]	DOAC marginally worse	RE-LY / ROCKET-AF / ARISTOTLE / ENGAGE-AF AFib+CKD subgroups
Y_bleed	-0.0396 [-0.0521, -0.0277]	DOAC ~27% relative reduction	RCT-consistent

Cohort n=8,990 → 4,220 post Crump-2009 trim (53% dropped). First observational W-lane to recover RCT direction on both outcomes. Trim sensitivity α = 0.05/0.10/0.15 robust (Δ ATT < 0.004).

FAERS pharmacovigilance

Pipeline B severity classifier — full 20M-row corpus.

Severity classification Temporal split (pre-2020 train, 2020-2025 test)

Split	Cohort	AUROC	AUPRC	ECE	Brier
Train	n=420,294 (pre-2020)	0.9562	0.8616	0.0946	0.0919
Validation	n=170,716	0.8858	0.7836	0.1751	0.1692
Test (2020–2025)	n=158,732	0.8872	0.7680	0.1744	0.1669

Test AUROC is approximately +9pp above the demographics-only Bate 2019 baseline (~0.78). ECE is recoverable via capability-conditional abstain (built into the platform) on the same test split; the raw ECE is reported here without that post-processing for transparency.

ACIC22 causal-inference challenge

V3 lock on the full 3,400-cohort canonical.

Track-2 substantial-equivalence claim Full canonical, V3 lock

Metric	Value	Notes
Bias	+19.26	per-cohort average
RMSE	28.80	full-cohort canonical
Coverage	77.53%	95% CI coverage of true treatment effect
Width	78.04	average CI width
Runtime	40.86s	per dataset

Coverage 77.53% on the full 3,400-cohort canonical (up from 7% on a flawed V1 lock; 11× lift via three ensemble bug fixes). Position vs leaderboard: not at DiConfounder pace (~+8 bias) but within the publishable methodological-comparison tier — useful for the second-method scorecard in any RWE paper.

DMPNN MoleculeNet (drug-discovery lane)

Chemprop v2 parity on scaffold splits.

Native PyTorch D-MPNN 3 seeds, Bemis-Murcko scaffold splits, no ensembling

Benchmark	Rosenbound D-MPNN AUROC	Chemprop v2 reference	Within σ overlap
BBBP (blood-brain barrier penetration)	0.9144 ± 0.0113	0.897 ± 0.012	Yes
BACE (β-secretase inhibition)	0.8861 ± 0.001	0.859 ± 0.024	Yes
HIV (replication inhibition)	0.7937 ± 0.0149	0.776 ± 0.020	Yes

In-house PyTorch rewrite (approximately 500 lines, no Chemprop runtime dependency). 5-test gradient-check suite green. Published-baseline parity under stricter conditions than the reference paper. Drug Discovery vertical detail →

HNSI clinical-NER

MIMIC-IV-Note v2.2 — 1.64M clinical notes processed.

NER + sectionizer + negation/context extraction en_core_sci_md + medspacy

Stage	Volume	Throughput	Coverage
W1 radiology corpus	570K notes / 572 chunks / 141 MB output	9.7 notes/sec	~30M entities, 13–18% negation rate
W2 discharge + radiology	1.07M notes / 1,071 chunks / 975 MB output	9.7 notes/sec	~50M+ entities
Total processed	1.64M clinical notes	en_core_sci_md NER + medspacy_sectionizer + ConText	80M+ entities with negation/historical/family attributes

Stage 1 cohort builder: 4.3 min via DuckDB. NER extraction throughput stable at 9.7 notes/sec on commodity hardware (numpy < 2.0 + spaCy 3.7.4 + thinc 8.2.5 + en_core_sci_md).

Source logs — raw training and evaluation logs for every benchmark above (MIMIC-IV W1/W2/W3/W5/W11, FAERS Pipeline B, ACIC22 Track-2 V3 lock, W12 sensitivity pentagon, DMPNN MoleculeNet, and the HNSI clinical-NER extraction) are retained as immutable run artifacts with SHA-pinned commit references in the VBSM ledger. Full log inventory and per-run provenance available to design partners under NDA on request.

Production-locked numbers. Full corpora. No cherry-picked seeds.