Every metric on this page is from a production training run on the full data corpus — not a smoke-run inflation, not a sub-sampled subset, not a best-of-N seed pick. Test AUROC, AUPRC, ECE, coverage, and bias values are quoted directly from timestamped log files. Raw logs available to design partners under NDA.
Five-fold cross-validation plus a temporal test split (training data pre-2020, test 2020+) on the full mimic4_full ingestion. Calibration via auto-pick between Beta+Bayes and isotonic per workload using validation ECE; calibrator chosen per workload, not per dataset family.
| Lane | Outcome | Cohort | Test AUROC | Calibration | Published reference |
|---|---|---|---|---|---|
| W1 | In-hospital mortality | n=546K admits | 0.9476 | Beta-Bayes ECE 0.0024 | Tomasev 2019: 0.92 |
| W2 | 30-day all-cause readmission | n=534K excl. deaths | 0.7034 | Isotonic ECE 0.0040 | Rajkomar 2018: 0.75–0.76 (different cohort) |
| W3 | Sepsis-3 onset (first 48h ICU) | n=74,829 ICU stays | 0.8908 | AUPRC 0.8710 | Saria 0.83–0.85, Komorowski 0.85, Hyland 0.82–0.86 |
| W5 | KDIGO AKI Stage 1+ (first 48h) | n=75K ICU stays | 0.8371 | AUPRC 0.6804 | Tomasev 2019: 0.82, Koyner 0.82 |
| W11 (Cox) | Mortality time-to-event | n=546K admits | c-index 0.8019 | IPCW c 0.6468 · IBS 0.0841 | Cheng 2019 LSTM: 0.81–0.85 |
| W11 (AFT) | Mortality time-to-event | n=546K admits | c-index 0.7707 | IPCW c 0.6375 | dual-comparator survival |
All five lanes locked under per-workload auto-pick calibrator. Calibration choice (Beta+Bayes vs isotonic) is data-driven from validation-set ECE, not preset.
| Estimator | Y_bleed ATT [95% CI] | Y_vte ATT [95% CI] | Direction |
|---|---|---|---|
| AIPW v3 (base) | +0.0253 [+0.0037, +0.0468] | +0.0766 [+0.0498, +0.1056] | Both positive (opposite of RCT) |
| AIPW v4 (severity-augmented) | +0.0271 | +0.0794 | Severity isn't the confounder |
| AIPW v5 (trajectory) | +0.0301 | +0.0802 | Trajectory isn't either |
| IV-LATE (2SLS, F=67,250) | +0.0625 [+0.048, +0.079] | +0.0102 [-0.002, +0.021] ✓ CI ⊇ 0 | IV recovers RCT non-inferiority on Y_vte |
| Neural counterfactual | +0.0255 (AIPW-equivalent) | deferred (M5 Pro 64GB) | Confirms unobservable confounder |
| Rosenbaum Γ-bound | Y_bleed Γ_zero = 1.06 · Y_vte Γ_zero = 1.17 | Very sensitive (both) | |
Five methods, same cohort. None recovers RCT direction on Y_bleed (Greinacher/Geerts RCTs); IV-LATE partially recovers on Y_vte. Γ = 1.06 = a 6% odds-ratio shift from an unmeasured confounder flips the bleed estimate. Quantitatively confirms confounding-by-indication signal — the open-confounder benchmark contribution.
| Outcome | ATT [95% CI] | Direction | RCT consistency |
|---|---|---|---|
| Y_stroke | +0.0084 [+0.0012, +0.0168] | DOAC marginally worse | RE-LY / ROCKET-AF / ARISTOTLE / ENGAGE-AF AFib+CKD subgroups |
| Y_bleed | -0.0396 [-0.0521, -0.0277] | DOAC ~27% relative reduction | RCT-consistent |
Cohort n=8,990 → 4,220 post Crump-2009 trim (53% dropped). First observational W-lane to recover RCT direction on both outcomes. Trim sensitivity α = 0.05/0.10/0.15 robust (Δ ATT < 0.004).
| Split | Cohort | AUROC | AUPRC | ECE | Brier |
|---|---|---|---|---|---|
| Train | n=420,294 (pre-2020) | 0.9562 | 0.8616 | 0.0946 | 0.0919 |
| Validation | n=170,716 | 0.8858 | 0.7836 | 0.1751 | 0.1692 |
| Test (2020–2025) | n=158,732 | 0.8872 | 0.7680 | 0.1744 | 0.1669 |
Test AUROC is approximately +9pp above the demographics-only Bate 2019 baseline (~0.78). ECE is recoverable via capability-conditional abstain (built into the platform) on the same test split; the raw ECE is reported here without that post-processing for transparency.
| Metric | Value | Notes |
|---|---|---|
| Bias | +19.26 | per-cohort average |
| RMSE | 28.80 | full-cohort canonical |
| Coverage | 77.53% | 95% CI coverage of true treatment effect |
| Width | 78.04 | average CI width |
| Runtime | 40.86s | per dataset |
Coverage 77.53% on the full 3,400-cohort canonical (up from 7% on a flawed V1 lock; 11× lift via three ensemble bug fixes). Position vs leaderboard: not at DiConfounder pace (~+8 bias) but within the publishable methodological-comparison tier — useful for the second-method scorecard in any RWE paper.
| Benchmark | Rosenbound D-MPNN AUROC | Chemprop v2 reference | Within σ overlap |
|---|---|---|---|
| BBBP (blood-brain barrier penetration) | 0.9144 ± 0.0113 | 0.897 ± 0.012 | Yes |
| BACE (β-secretase inhibition) | 0.8861 ± 0.001 | 0.859 ± 0.024 | Yes |
| HIV (replication inhibition) | 0.7937 ± 0.0149 | 0.776 ± 0.020 | Yes |
In-house PyTorch rewrite (approximately 500 lines, no Chemprop runtime dependency). 5-test gradient-check suite green. Published-baseline parity under stricter conditions than the reference paper. Drug Discovery vertical detail →
| Stage | Volume | Throughput | Coverage |
|---|---|---|---|
| W1 radiology corpus | 570K notes / 572 chunks / 141 MB output | 9.7 notes/sec | ~30M entities, 13–18% negation rate |
| W2 discharge + radiology | 1.07M notes / 1,071 chunks / 975 MB output | 9.7 notes/sec | ~50M+ entities |
| Total processed | 1.64M clinical notes | en_core_sci_md NER + medspacy_sectionizer + ConText | 80M+ entities with negation/historical/family attributes |
Stage 1 cohort builder: 4.3 min via DuckDB. NER extraction throughput stable at 9.7 notes/sec on commodity hardware (numpy < 2.0 + spaCy 3.7.4 + thinc 8.2.5 + en_core_sci_md).
Watch the product walkthrough at rosenbound.ai — three moments that define the platform: the Cognitive Validation Report refusing incoherent data, the live Γ-bound sensitivity visualization, and the reproducibility certificate generated on every study. The full platform stays gated for Founding Partners.
Watch the preview →
pip install rosenbound
— Official Python SDK for programmatic access: cohort upload, sensitivity-bounded study runs, and reproducibility certificate retrieval. Apache 2.0; Pydantic v2 typed; py.typed for IDE autocomplete + mypy. Platform access gated by Bearer token + RBAC + tenant scoping — the SDK is open, the audit substrate is not.
Founding Partner Program includes a benchmark co-authorship clause: Rosenbound runs the full pentagon on your in-house cohort (under your IP terms) and the resulting methodology paper carries your team as co-authors.