Sentinel: ESG Safeguards Screen for Carbon-Credit DD

Featured ProjectMay 2026

Sentinel: ESG Safeguards Screen for Carbon-Credit DD

A working v0.3 prototype for the question the carbon-credit DD stack hasn't answered yet, not "did the project remove the carbon," but "who's getting hurt, who's getting paid, who's going to sue." Sentinel screens an uncurated carbon project across seven cited signals: Indigenous-territory overlap, adverse news (with topic-and-claim co-occurrence to kill false positives), active litigation from the Sabin Center, NGO complaints, country forest-cover trend, TI CPI 2024 bands, and a four-item procedural FPIC checklist mapped to UN-REDD operational guidelines. The composite is a transparent rule, not a model. An LLM then drafts the Safeguards section of an IC memo from the cited evidence, with a grounding sniff that flags any term in the output that doesn't trace back to the panels.

→7 cited signals · ICVCM CCP #5 as the framework anchor · 94 / 94 tests passing · 15 / 15 blind-test projects with 0 crashes · cold-detected Northern Rangelands Trust (Kenya) as RED 11 with three real adverse articles surfaced from Mongabay, Survival International, FIDH

ICVCM CCP #5 Anchored7 Cited Signals94 / 94 Tests · 0 CrashesCold-Catch: NRT Kenya RED 11Standards Page · Provenance

PythonFlaskNext.jsTypeScriptTailwindOpenRouter LLMGDELT · Google News · Sabin CenterWorld Bank · TI CPI · Native LandICVCM CCP #5UNDRIP · ILO 169 · Cancun SafeguardsFPIC Procedural ChecklistGrounding Sniff Failsafe

Private repo · available on requestOpen the demo site

See it in action

axiom.app/dashboard

Click a sample project, four data calls fan out in parallel, panels populate, composite verdict lands.

Walkthrough

Three short clips. Watch them in order and you’ve seen the whole product.

axiom.app

01 · Run a screen

Four parallel data sources fan out, Native Land, GDELT + Google News, Sabin Center, NGO ledger. Country-level E + G + FPIC checklist render alongside. Verdict header is a composite by transparent rule.

axiom.app

02 · The IC-memo synthesizes

An LLM drafts the Safeguards section of an IC memo from the cited evidence above. A grounding sniff flags any term in the output that didn't appear in the source JSON, the model can refuse with "Evidence is insufficient for a defensible memo."

axiom.app

03 · Same engine, opposite verdicts

Click the quick-compare pill bar. Cordillera Azul → RED HIGH. Mikoko Pamoja → GREEN LOW. Same engine, same query, same UI, the verdict is entirely driven by which evidence the engine surfaces.

Pipeline Architecture

Project name + coordinates → four parallel calls (Native Land · GDELT/Google News · Sabin Center · NGO ledger) → country-level Environmental + Governance + FPIC procedural checks → transparent composite rule → cited verdict → LLM drafts the IC-memo Safeguards section under grounding rules.

What’s actually happening at each stage

Each stage is explained twice, first for the finance reader, then for the engineer.

1. Eliminate Five Pain Points: Pick the One That Fits

Finance lens

Five candidate pain points went through five elimination rounds, time-saved, product-fit with Qatalyst's existing surface, buildability with free data, defensibility against B2B SaaS commoditisation, and narrative fit. Methodology drift fired too rarely. PDD parsing was already on Qatalyst's roadmap. Rater-reconciliation needed Sylvera/Calyx paywalled data. IC-memo drafting was a feature, not a product. The survivor was the safeguards screen, the one Caroline Guyot had explicitly named at the company's launch, and the one no incumbent owned.

Engineering lens

Every elimination was anchored to either a free public data source the engine could actually call, or a specific product surface Qatalyst had already shipped. The work was as much in eliminating the wrong builds as in choosing the one to build.

2. Seven Cited Signals: Each Anchored to a Public Standard

Finance lens

ICVCM Core Carbon Principle #5, Sustainable development benefits and safeguards, is the framework anchor. Every Sentinel signal maps to a CCP #5 sub-criterion. Indigenous overlap and FPIC procedural checks anchor to UNDRIP Articles 10, 19, 32 and ILO Convention 169. Adverse news + NGO complaints anchor to CCP #5's "identify, prevent, mitigate adverse impacts." Litigation anchors to the Sabin Center registry (cited in IPCC AR6 WG3 Ch.15). Forest trend uses the World Bank AG.LND.FRST.ZS indicator with FAO's 0.3 pp/yr threshold for the RED band. CPI bands are anchored to Transparency International's own published global average of 43, not invented cut-offs.

Engineering lens

Composite score is a transparent rule (`services/score.py`), not a model. ≥ 8 RED, 3–7 AMBER, < 3 GREEN. Every threshold is in one file an analyst can read and tune in a single edit. Rule-based scoring is litigation-defensible in a way "0.73 risk score" from a learned model isn't.

3. The FPIC Procedural Checklist: Not a Keyword

Finance lens

Free, Prior and Informed Consent is a procedural standard from UNDRIP/ILO 169, not a binary flag. Mistaking it for a keyword was the first thing the v0.2 scorer got wrong (an article titled "Project secures FPIC" scored the same as "Project violates FPIC"). v0.3 turned FPIC into a four-item procedural checklist matching the UN-REDD operational guidelines: Consultation documented · Consent obtained or refusal respected · Operational grievance mechanism · Right to withdraw consent / no forced displacement.

Engineering lens

Each check returns pass / fail / insufficient / not applicable with a rationale and the evidence titles used. Fails draw from explicit news phrases ("without consultation," "consent withdrawn," "forcibly relocated"). Grievance fails when active NGO complaints or litigation are present (grievances have escalated outside the project's internal mechanism). Defaulting to "insufficient" when public signals don't answer is the honest call, a screen reads news; a DD analyst reads project documents.

4. Blind-Tested Against 15 Unseen Projects

Finance lens

The engine was put against 15 projects it had never seen, 10 edge cases (Mai Ndombe DRC, Katingan Mentaya Indonesia, Suruí Brazil, plus 7 more including fictional clean controls, Unicode names, mid-ocean and polar coordinates), and 5 blind tests run twice, once with v0.3 strict topic+claim scoring, then again after a cascading-query-variant fix. Zero crashes across all 15. The fix unlocked the headline cold-catch: Northern Rangelands Trust (Kenya), heavily covered by NYT and Survival International but invisible to the first scorer, returned RED 11 with three real adverse articles surfaced from Mongabay, Survival International, and FIDH.

Engineering lens

Test surface: 94 unit + integration tests · 10 live edge-case projects · 5 blind-test projects · 0 crashes observed. Latency runs 5–30 seconds depending on the LLM synthesis step (data fetch alone is ~3s). Every external call has a graceful fallback; every result has a stable shape; every input is validated at the system boundary.

5. Honest by Default: Including the Seams

Finance lens

The showcase has a dedicated Standards page mapping every signal to its public anchor, and a Brief that names the v0.4 seams in plain English: Indigenous-overlap is dark for uncurated projects until a paid Native Land key is wired; the news layer is English-only and needs an LLM-per-article classifier to fully replace the keyword scorer; environmental + governance are country-level proxies (project-polygon GFW GLAD is on the v0.4 punchlist); no recency decay on news yet; no result cache yet. Owning the gaps is part of the artifact.

Engineering lens

Two repos: Flask + Python engine (the working tool) and Next.js static showcase (the public surface, deployed to Vercel). Engine has the 94-test suite, the FPIC module, and a STANDARDS.md doc that prints every threshold's provenance. Showcase has the four routes (Overview · Standards · Walkthrough · Brief) and three silent ~60s clips of the engine running end-to-end.

Methodology notes

Engine: Flask + Python, four parallel data calls per screen via `ThreadPoolExecutor`. Multi-tier resolution at every call site, live API → cached → empty with coverage flag. Every external integration has a graceful fallback; every result is shape-stable.

Adverse-news scoring is topic + claim co-occurrence. An article is "adverse" only if its title contains both a safeguards-topic keyword (Indigenous, FPIC, consent, carbon, forest…) AND an adverse-claim term (lawsuit, evicted, junk, fraud, suspended…). NGO-domain articles get a precision boost. This kills the v0.2 false positive where any title containing the word "indigenous" was flagged.

Cascading query variants for news retrieval: exact-quoted name → stripped-suffix variant → distinctive-core variant. Unlocked Northern Rangelands Trust cold-detection, the initial exact-quoted query returned zero hits despite known major-outlet coverage; the stripped-suffix variant surfaced three.

LLM synthesis runs with `temperature 0.1`, a strict SYSTEM prompt enforcing ten anti-hallucination rules, and a post-output grounding sniff that flags any capitalised proper-noun-like token in the output absent from the evidence JSON. Falls back to "Evidence is insufficient for a defensible memo" when the evidence is too thin to defend an opinion.

Showcase is a separate Next.js static site (no backend, no env keys, every route prerenders), deployed to Vercel free tier with sub-second cold starts. Engine and showcase are intentionally separate repos so the showcase can be public while the engine ships behind whatever auth a production deployment needs.

What this isn’t (yet)

The honest limits. A page called “honest” with no limitations would be a credibility own-goal.

Indigenous-overlap layer is dark for uncurated projects

Native Land Digital deprecated its free tier in late 2025. The engine ships a bundled overlay for the four sample projects; for any new project, the territory signal returns "coverage: unknown" until a paid Native Land API key is wired. The showcase admits this explicitly.

News layer is English-only

GDELT + Google News are queried in English. A multilingual layer plus an LLM-per-article claim classifier is the v0.4 plan, the topic+claim scorer is precision-first; an LLM classifier would close the recall gap on languages and on adverse claims phrased without trigger words.

Environmental + Governance are country-level

World Bank forest-cover and TI CPI are country-level proxies. A REDD+ project sitting in 0.001% of Brazil's landmass shouldn't be judged by Brazil's national forest trend. v0.4 wires project-polygon GFW GLAD alerts via the IUCN WDPA + GFW APIs.

Verdicts are not yet reproducible run-over-run

Google News RSS shifts results constantly. Same project queried twice can return different evidence sets, which can swing a borderline verdict between AMBER and GREEN. v0.4 ships an in-memory cache keyed on (project, day) so the same project returns the same evidence within a session.

Not a verdict tool

Sentinel is a first-pass screen that directs the analyst's attention. It does not replace the project workpaper, the IC memo, or the analyst's judgment. The Brief page is explicit on this. The clean-controls test (fictional projects in Italy and Tasmania) verified the engine does not invent risk where none exists, that's the most important property and it holds.