.

open opened · touched

# SUB-019 , Cosmos<->Cosmo Revision Divergence report + SETT Eval Quality Pulse

> Two HTML artifacts Sergey sent, ingested 2026-06-07 from logged-in saves in ~/Downloads (both auth-gated behind @sett.ai SSO, so fetched-while-authenticated). > 1. Revision Count Divergence , https://sett-internal-decks.vercel.app/reports/revision-divergence/index.html . Snapshot 2026-06-05, current week 2026-06-01, read-only prod. > 2. SETT Eval / Quality Pulse , https://sett-eval-tiktok.vercel.app/pulse . Live, updated daily. > Local saves: ~/Downloads/Cosmos ↔ Cosmo — Revision Count Divergence.html (+ _files/), ~/Downloads/SETT Eval — Rate Playables.html (+ _files/). > Significance: the divergence report is the SQL-level resolution of our pain cluster C2 "Numbers nobody trusts" (the 34/104/181/244 revision mismatch). The Pulse is a NET-NEW quality-measurement surface we had no record of.

---

PART 1 , Revision Count Divergence (resolves C2)

What it is: an item-by-item reconciliation of why the Cosmos board and the Cosmo delivery page report different revision counts. Every divergent orchestration listed + linked, with how it shifts each side's count, plus a per-problem SQL validation rule, plus a reusable agent prompt to re-run the check on a schedule.

The core finding , they count DIFFERENT OBJECTS

Cosmo delivery page counts ORCHESTRATIONS. A revision = revision_count > 0 in an active stage (apollo/build/ideation), minus dead/cancelled, minus build/completed. No quota filter, no week filter. (get_revisions_in_progress + get_revisions_pending, delivery_repository.py.)
Cosmos board counts QUOTAS. A revision = a quota of type='revision' on the current week, not completed/unfillable. Quota status is rolled up from its Apollo task, and one quota can bundle multiple orchs. (quotas WHERE type='revision' AND assigned_week=current.)
They will NEVER match by themselves. To compare you must normalize to one grain (orchestrations with revision_count > 0 on the current week).

The reconciliation (snapshot 2026-06-05)


Cosmo delivery page:        112 revision orchestrations (all weeks)
Cosmos board (orchs):        97 orchs in current-week revision quotas
Cosmos board (quotas):       94 revision quotas (current week, not done)

112 Cosmo = 92 shared + 20 Cosmo-only 97 Cosmos = 92 shared + 5 Cosmos-only 97 orchs -> 94 quotas via bundling (-3)`92 revisions are counted identically by both (the baseline). The gap is fully explained by 5 named problems below.


The 5 problems that cause divergence

| # | Problem | Effect | Mechanism | |---|---------|--------|-----------| | P1 | Detached revisions , no quota | Cosmo +11 / Cosmos +0 | Real revision (revision_count>0) with quota_id IS NULL. Live in active stage so Cosmo counts it; sits on no quota so Cosmos never shows it. Fell off via rollup sibling-detach that never reattached, deleted/archived quota, or pre-quota-system orphans. 5 are stale zombies untouched since before May. | | P2 | Revision advanced to build, quota flipped to completed | Cosmo +6 / Cosmos +0 | Apollo task succeeded -> build kicked off -> rollup marked quota completed (succeeded->completed, stays through build). Cosmos drops it (not-done board); orch is build/queued or build/in_progress so Cosmo still counts it. This is the count moving in real time (orch 4547 crossed mid-snapshot). | | P3 | Revision on a prior-week quota | Cosmo +3 / Cosmos +0 | Quota completed + stuck on an earlier week (2026-05-11). Apollo finished weeks ago (apollo/completed) but never advanced to build/delivery. Cosmo is all-weeks so counts; current-week Cosmos board doesn't. 2 of 3 also sit in version quotas. | | P4 | Fake revision quotas , version tasks with no revision | Cosmo +0 / Cosmos +5 | Atype='revision' quota holding a version Apollo task with revision_count=0 , no revision behind it. Cosmos counts the quota as a revision; Cosmo (keys off revision_count>0) does not. 5 orchs across 2 quotas (1365, 890), all owned by the same PG. | | P5 | Bundling , quota holds >1 orch | Cosmo +0 / Cosmos quotas -3 | Pure unit conversion: Cosmos counts the quota once, Cosmo counts each orch. The 97->94 gap. Right now the only bundled quotas are the fake ones from P4. |


Data-integrity caveats (don't change the count, change interpretation)

Do NOT use orchestration.updated_at for "last activity". A bulk job on 2026-05-27 06:39 bumped 45 orchs' updated_at (121 that day) with no orchestration_states row (SQLAlchemy onupdate fires on any row write). Real signal = max(orchestration_states.created_at).

revision_count > 0 with no feedback row. Some orchs carry revision_count=1 with zero rejection/clarification feedback behind it (e.g. detached 1923, prior-week 1461) , the count was incremented without a triggering feedback. Any check trusting revision_count as ground truth must expect these.

`Reusable validation , a verbatim agent prompt`


The report ships a copy-paste prompt for any agent on the SETT data sources (it names Marty, Yoni's Agent, Logos as targets) to reproduce the whole reconciliation on a schedule and only escalate to a human when a real inconsistency appears (fake revision quota exists, a detached zombie >30d, a Cosmos-only orch with

revision_count>0

 = new bug, the identity doesn't close, or a Cosmo-only orch falls into 'other'). Cadence: weekly right after Sunday rollover (cleanest), or daily. Full prompt preserved in

/tmp/revision-divergence.txt

 and the saved HTML.
What this contributes to US

RESOLVES cluster C2 at the mechanism level. Our SUB-018 had C2 as "no single source of truth, counts diverge across pages." Sergey's team now gives the exact reason: the two pages count different objects (orchs vs quotas), and the entire gap decomposes into 5 SQL-checkable problems. The "trust crisis" is not unknowable; it is fully reconciled.
The fix is NOT "make them match" , it is normalize-to-one-grain + a standing validation agent. This reframes our opportunity #2 (trust-first number contract): the UX move is to (a) label which object each count counts, (b) surface the reconciliation bridge, (c) flag the P1-P4 anomalies inline (detached zombie, fake revision quota, prior-week stuck).
Confirms our member-pains were real: detached/orphan revisions (C1+C2), revision-advanced-to-build flipping status (C1 status-flip), prior-week stuck on finished quota (matches SUB-017 "revisions stuck on a finished quota / Dead Ends"), fake-revision-quota = a single PG mis-tagging version work as revision (matches C3 origin-conflation AND the Mark-as-Sent mis-tag risk).

Net-new vocabulary: orchestration vs quota vs Apollo task as the three grains; revision_count; quota type (revision/version); rollup succeeded->completed; the orchestration_states table as the true activity log. Add to the entity glossary.


Named PGs in prod data (for People-directory): Danielle Benjamin, Matan Nissimov, Dafna Goldman Klinger, Omer Barzel, Elad Werber, Tal Mor, Aviv Shayov. (These are the revision owners on the current-week board.)

---
PART 2 , SETT Eval / Quality Pulse (net-new surface)
What it is: a live creative-quality dashboard at

sett-eval-tiktok.vercel.app/pulse

. "SETT creative quality, updated daily." Expert raters score delivered playables BAD/OK/GOOD; the pulse aggregates.
Numbers (snapshot 2026-06-07)
% of playables rated BAD, last 14 days: 23.5% , down 16.5 pts over the tracked window. (115 expert ratings, +-7.7 pts 95% CI.)
Split (all ratings): BAD 23.5% / OK 42.6% / GOOD 33.9% (115 expert ratings).
14-day rolling trend chart (BAD/OK/GOOD), window 04-28 -> 06-06.
Exploration coverage: 430 of 1,489 playables rated (28.9%).
Toggle windows: 14d / 30d / all-time.
"Hot takes" feed (qualitative): "Nothing happens or reacts to my clicks?" (Clash Royale); best-rated Solitaire Grand Harvest 83% GOOD (6 ratings); a 668-char writeup on Fish Of Fortune ("Music too loud... UI shows up with low opacity..."); "Not clear enough what to do" (Domino Dreams); "Buggy, the items dont move!" (Merge Dragons); snap verdict Simons Cat Match rated GOOD in 2.3s.
What this contributes to US

An actual creative-quality measurement loop exists , separate from the revision rate. This is a SECOND quality signal we had no record of. Revision rate (28% per Sergey's ops doc) measures rework volume; the Pulse BAD% (23.5%, falling) measures expert-judged quality of delivered work. Two different lenses on "is the work good."
The "430 of 1,489" coverage figure matches the Pulse exactly , this is the same 430 number that appears in our Scorecard "Delivered 430" (SUB-017). Likely the same delivered-playable population being rated. Worth confirming the link.
Qualitative hot-takes are a feedback goldmine , the free-text rater comments ("nothing reacts to my clicks", "not clear what to do", "items don't move") are exactly the revision-cause categories the never-built 6-type taxonomy was meant to capture (C3). The Eval tool is generating the very signal Cosmo can't.

TikTok in the URL (sett-eval-tiktok`) , unexplained. Possibly the rating UI is TikTok-style (swipe/snap verdict, "rated GOOD in 2.3s"). Flag to ask Sergey.
This is likely the source behind Sergey's revision-rate-vs-quality framing. Connects to our still-open C2 question (is 28% a real quality climb?) , the Pulse says BAD% is DOWN 16.5pts, so quality is improving even if revision volume is up.

---

Open questions to take back to Sergey / eng

Is the Pulse's BAD% the metric leadership should watch alongside revision rate? Are they correlated?
Does the Eval rater free-text get back into Cosmo/ACK, or does it die in the Eval tool? (If it dies, that's the C3 taxonomy gap AND the C5 knowledge-write gap, both unsolved.)
The divergence report's validation agent , is it scheduled yet? Who owns running it (Marty / Yoni's Agent / Logos)?
Confirm the 430 coverage number is the same delivered population as the Scorecard.
What is "TikTok" in the eval-tiktok URL?

context-shift paste this id into chat to resume on this thread