Scarcity — Benchmark Findings Report

Date: 2026-05-11 (§39 engine-routed calibration re-run; §38 full-mode calibration results; §37 weakness audit; §36 typed validation v3 fixes; §35 real-data typed discovery validation; §34 synthetic GT validation suite; prior: 2026-04-26 v11 KEN) Environment: Python 3.11.9 | numpy 2.3.5 | scipy 1.15.3 | Windows 11 Dataset: World Bank annual indicators — Kenya (KEN), Tanzania (TZA), Uganda (UGA), 1990–2023 Indicators: 19 macroeconomic series Scripts: scripts/benchmark_proper.py, scripts/benchmark_comprehensive.py, scripts/benchmark_reviewer.py, scripts/benchmark_economic_simulation.py, scripts/experiment_east_africa_federation.py, scripts/benchmark_scientific_questions.py, scripts/benchmark_harness.py (comprehensive 26-stage harness) Artefacts: artifacts/meta/, artifacts/harness/

Contribution

Scarcity is a federated causal discovery system for streaming, data-scarce environments where supervised methods fail and centralised learning is infeasible. It discovers structural patterns incrementally as observations arrive — without requiring a full dataset upfront and without centralising data — and uses those patterns to drive policy simulation.

The primary contribution is a binary capability unlock: local evidence accumulation reaches confidence 0.205 (below the 0.25 simulation gate); federated evidence-sharing lifts confidence to 0.298 (above the gate), enabling shock propagation that is 91% directionally coherent with documented economic relationships from IMF and World Bank publications.

This is not a marginal improvement over a weaker model. Without federation, the PolicySimulator returns empty trajectories for all shocks. With federation, it produces economically meaningful shock propagation validated against macroeconomic theory. No supervised baseline achieves this: AR(1) and its variants are predictors, not discoverers; they have no simulation capability at any level of confidence.

System Architecture

Scarcity is a four-layer platform. The benchmark exercises the bottom two layers directly; the upper two are the operational consumers of what the benchmark validates.

 ┌─────────────────────────────────────────────────────────────────────┐
 │  PRESENTATION LAYER                                                  │
 │  K-SHIELD · Institution Portal · SENTINEL dashboards (Streamlit)    │
 ├─────────────────────────────────────────────────────────────────────┤
 │  INTELLIGENCE LAYER                                                  │
 │  KShieldHub · EconomicGovernor · PulseSensor (15 SIGINT signals)    │
 │  KenyaCalibration · ScenarioTemplates · ScarcityBridge              │
 ├─────────────────────────────────────────────────────────────────────┤
 │  FOUNDATION LAYER  ◄── benchmark targets this layer                 │
 │  scarcity.engine      OnlineDiscoveryEngine (15 hypothesis types)   │
 │  scarcity.federation  FederationNode / FederationHub / baskets      │
 │  scarcity.simulation  MultiSectorSFCEngine + IO structure (KNBS)    │
 │  scarcity.meta        Reptile / MAML meta-learner + GlobalMetaMemory│
 │  scarcity.governor    DynamicResourceGovernor (DRG)                 │
 │  scarcity.causal      DoWhy causal identification                   │
 ├─────────────────────────────────────────────────────────────────────┤
 │  DATA LAYER                                                          │
 │  World Bank REST API · FRED · FederatedDatabases · StreamIngester   │
 └─────────────────────────────────────────────────────────────────────┘

A. OnlineDiscoveryEngine — hypothesis survival paradigm

The engine treats relationship discovery as a survival-of-the-fittest competition among hypotheses. Each hypothesis is a probabilistic model of one relationship between two variables.

 Streaming rows
       │
       ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  HypothesisPool  (up to 15 types per variable pair)             │
 │                                                                  │
 │  CausalHypothesis       — Granger F-test; forward + backward    │
 │                           Bayesian accumulators (α_fwd/β_fwd,   │
 │                           α_bwd/β_bwd); direction set via       │
 │                           F-ratio asymmetry guard (F_fwd/F_bwd  │
 │                           ≥ 1.3); confidence = conf_fwd         │
 │                                                                  │
 │  TemporalHypothesis     — AR(1) autoregressive persistence       │
 │  CorrelationalHypothesis— Online Pearson; bidirectional signal   │
 │  MediationHypothesis    — Two-stage Sobel test (X → M → Y)      │
 │  FunctionalHypothesis   — Polynomial regression                  │
 │  + 10 additional types (Equilibrium, Structural, Compositional, │
 │    Competitive, Synergistic, Moderating, Probabilistic,         │
 │    Graph, Similarity, Logical)                                   │
 └─────────────────────────────────────────────────────────────────┘
       │  each row: fit_step → evaluate → update Bayesian accumulators
       ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  MetaController — hypothesis lifecycle state machine             │
 │                                                                  │
 │  TENTATIVE ──► ACTIVE ──► DECAYING ──► DEAD                     │
 │                                                                  │
 │  Promotions: conf ≥ 0.25 AND evidence ≥ min_ev                  │
 │  Kill condition: conf < 0.10 AND evidence > 20                   │
 │  BH-FDR at q=0.05: soft penalty (×0.92) on low-evidence hyps    │
 └─────────────────────────────────────────────────────────────────┘
       │
       ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  HypothesisArbiter — one winner per (source, target) pair        │
 │  Sorted by (confidence, get_strength) descending                 │
 │  Causal > Temporal > Correlational (type priority)               │
 └─────────────────────────────────────────────────────────────────┘
       │
       ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  HypergraphStore — knowledge graph with temporal decay           │
 │  Edges: (source, target, effect_size, confidence, stability)     │
 │  Simulation gate: confidence ≥ 0.25 → emitted to PolicySimulator│
 └─────────────────────────────────────────────────────────────────┘

Key internal algorithms:

Algorithm	Where	Role in benchmark
Incremental Granger F-test (RLS)	`CausalHypothesis.update()`	Sets direction (+1/−1/0); primary signal
Bayesian accumulator (α/β)	`CausalHypothesis`	confidence = α_fwd / (α_fwd + β_fwd)
F-ratio asymmetry guard	`relationships.py`	F_fwd/F_bwd ≥ 1.3 before direction commit
BH-FDR correction (q=0.05)	`discovery.py`	Penalises low-evidence hypotheses (fix #1)
Live-direction override	`relationships.py`	Mini-buffer ≥15 live rows overrides pretrain direction (fix #2)
Page-Hinkley drift detection	`vectorized_core.py`	Resets coefficients on structural breaks
Thompson sampling (BanditRouter)	`bandit_router.py`	Exploration-exploitation over hypothesis types
Sobel mediation test	`relationships_extended.py`	X → M → Y indirect effects

B. Federation layer — basket-routed evidence sharing

The benchmark runs federation through FederationHub → FederationNode → per-basket engines. This is distinct from FedAvg: nodes share observation rows, not model parameters.

 ┌───────────────────────────────────────────────────────────────┐
 │  FederationHub                                                 │
 │  ├─ register(node)                                             │
 │  ├─ broadcast(row, source_node_id)                             │
 │  └─ sync_directions() — majority-vote CausalHypothesis dirs   │
 └──────────────────────┬────────────────────────────────────────┘
                        │ peer rows (trust-weighted, renormalised)
          ┌─────────────┼─────────────────┐
          ▼             ▼                 ▼
   FederationNode    FederationNode   FederationNode
       KEN               TZA               UGA
          │
          │  per-basket isolated engines
          ├── basket: macro        → OnlineDiscoveryEngine
          ├── basket: financial    → OnlineDiscoveryEngine
          ├── basket: infrastructure → OnlineDiscoveryEngine
          └── basket: human_capital → OnlineDiscoveryEngine

Basket routing (BasketRegistry) ensures cross-basket contamination is impossible: each engine sees only the variables in its own sector schema, enforced at both pretrain() and receive_peer() boundaries.

Peer renormalisation (fix #3): before feeding a peer row to own-country engines, the row is z-scored to peer-country scale then re-expressed in own-country scale using rolling-window mean/std (last 15 own observations), removing cross-country level differences while preserving relative moves.

begin_live_stream(): after pretraining, all hypothesis confidences are discounted by 50% and evidence counts are capped at 10, so live observations can revise pretrained directions without the MetaController kill condition firing prematurely.

C. Simulation engine — Stock-Flow Consistent economy

The PolicySimulator (and underlying MultiSectorSFCEngine) consumes the knowledge graph produced by the engine. It propagates shocks forward through discovered relationships.

  Discovered relationships (conf ≥ 0.25)
          │
          ▼
  ScarcityBridge.create_learned_economy()
          │
          ▼
  MultiSectorSFCEngine  (4 SFC sectors: AGR / MFG / SRV / INFORMAL)
  ├─ production.py    CES output function
  ├─ labor_market.py  Wages + unemployment (Okun's Law)
  ├─ price_system.py  CPI + import prices (Phillips Curve)
  ├─ government.py    Fiscal block (taxes, debt, expenditure)
  ├─ monetary.py      Taylor Rule + interest pass-through
  ├─ foreign.py       Current account + FX
  └─ banking.py       Credit, CAR, NPL
          │
          ▼
  Shock propagation (5 steps from base state)
  → directional response per variable validated vs IMF/WB theory

The IO structure (io_structure.py) bridges KNBS 9-sector to SFC 4-sector using the standard IO aggregation formula. Column sums satisfy the Hawkins-Simon condition (AGR=0.42, MFG=0.46, SRV=0.49), ensuring the Leontief inverse is economically meaningful.

D. How the benchmark exercises the architecture

The discovery benchmark (scripts/benchmark_discovery.py) runs four conditions (A–D) that directly stress-test specific architectural paths:

Condition	Engine init	Peer data	Architecture path exercised
A. Cold-start, no fed	Fresh	None	Engine alone; all signal from 44 KEN rows
B. Cold-start + fed	Fresh	TZA+UGA	Hub broadcast + basket routing + peer renorm
C. Pretrained, no fed	SSA prior	None	begin_live_stream + live-direction override
D. Pretrained + fed	SSA prior	TZA+UGA	All paths; direction sync from hub

The 70% conf-weighted sign accuracy target in conditions A/B directly measures whether the Bayesian accumulator + F-ratio asymmetry guard + BH-FDR pipeline produces directionally reliable hypotheses from 44 annual observations. The pretrained conditions (C/D) additionally test whether begin_live_stream + live-direction override can correct SSA-corpus direction inversion with only 44 live override observations.

1. What This Benchmark Tests

Primary claims (paper stands or falls on these)

Claim	Section
C1. The nodes have genuinely non-IID data — FL prerequisite satisfied	§6
C2. Federation is harmful with FedAvg but beneficial with Scarcity's evidence-sharing	§4, §9
C3. Scarcity accumulates useful relationship evidence where all supervised baselines fail	§4, §12

Supporting claims

Claim	Section
S1. Meta-learning warm-start accelerates new node onboarding	§8
S2. Scarcity generalises to an unseen domain (Ethiopia)	§10
S3. The DRG provides a quantifiable compute/accuracy trade-off	§11
S4. Discovered relationships produce economically coherent shock propagation	§13

Characterisation findings (honest accounting, not claims)

Finding	Section
Online engine does not outperform batch AR1 on point prediction	§7
FL is harmful below 13 years of local data (cold-start threshold)	§9
Buffer size does not affect annual-frequency results	§15C
Confidence ≠ statistical significance; 41% false positive rate on null data	§22
Temporal ordering not detected; confidence measures pattern consistency	§22
Only TEMPORAL hypothesis type confirmed at annual frequency (N≤34)	§17

2. Evaluation Protocol

Prediction accuracy — rolling leave-one-year-out:

For each year T from (start + 5) to 2023:
    train on all years < T
    predict year T, compute normalised MAE and R²

Normalisation: z-score per indicator on training data. MAE < 1.0 beats naive z-score predictor.

No fold leakage: Year T is never in the training set. Normalisation statistics are computed on the held-out actuals after all folds complete — not on training data. Oracle-AR1 uses the same temporal boundary as Local-AR1 (pools all countries but trains only on rows with year < T).

Discovery quality (Scarcity only): - conf@end — mean confidence of active hypotheses at stream end - steps→0.25 — first step at which mean confidence crosses the simulation gate

Statistical rigour: 20 random seeds, mean ± std, 95% CI, Welch t-test (two-tailed), Cohen's d.

What seeds affect: - RandomBaseline: seeded directly; predictions vary across seeds - Synthetic data (dry-run): numpy.random seeded; data varies per seed - AR1, FedAvg, Oracle, Scarcity: deterministic given fixed data; seed-invariant on real WB data

3. Baselines

Level	Method	Description
Trivial	Random	Predict U[min, max]
Trivial	Mean	Predict training mean
Standard	Local-AR1	AR(1) per indicator, local data only (Hamilton 1994)
Stronger-still-fails	Ridge-Lag	Ridge regression on all 18 cross-variable lag-1 features
FL standard	FedAvg-AR1	AR(1) + federated parameter averaging (McMahan et al. 2017)
Upper bound	Oracle-AR1	AR(1) on pooled all-node data — not deployable
Proposed	Scarcity	Scarcity engine, cross-node evidence sharing

Why AR(1) is the right supervised baseline

VAR requires N > k·p = 19 rows minimum; LSTM requires ~100+ sequences; ARIMA and Prophet degenerate on annual data. At N=5–24, AR(1) is the strongest numerically stable supervised baseline (Hamilton 1994).

Ridge-Lag validation (§4b): To confirm this is not a weak baseline choice, we add Ridge regression with all 18 cross-variable lag-1 features (α=10 regularisation). Despite being strictly more powerful than AR(1) in capability, Ridge-Lag produces MAE=1.026 vs AR(1)=0.860 at mean N=19 training rows with 18 features per indicator — 19.3% worse. This confirms the p/n ratio (19 features, 5–24 rows) is genuinely the binding constraint, not the choice of AR(1) as baseline. More complex models fail harder at this sample size.

Modern FL variants: FedProx (Li et al. 2020) and SCAFFOLD (Karimireddy et al. 2020) are stronger FL variants but still average model parameters — they share FedAvg's structural failure mode in heterogeneous settings. They require larger datasets for a fair comparison.

Scarcity Benchmark 01: Architecture and Protocol