Glossary
Terms, defined plainly.
The concepts that come up across the writing, each in a line or two, with a link to an authoritative source if you want to go deeper.
- McNemar's test
-
A statistical test for paired nominal data. Used here to check whether a before/after change in pass or fail outcomes on the same test cases is real, not noise.
Wikipedia - Wilson score interval
-
A confidence interval for a proportion that stays sensible with small samples and rates near 0 or 1, where the naive interval breaks down.
Wikipedia - Cohen's kappa
-
A measure of agreement between two raters that corrects for the agreement you would expect by chance. Used to check how well an AI judge tracks a human grader.
Wikipedia - pass^k / best-of-N
-
pass^k is the chance a stochastic agent passes the same case on all k runs, a stricter bar than passing once. best-of-N takes the best of N sampled attempts.
Codex paper (arXiv) - Idempotency
-
A property where running an operation many times has the same effect as running it once. Essential for safely retrying work after a partial failure.
Wikipedia - Transactional outbox
-
A pattern that writes an event to an outbox table in the same database transaction as the state change, then relays it, so the state and the event cannot diverge.
microservices.io - Watermark
-
In stream processing, a marker that tracks event-time progress so a system knows when it has seen enough late data to safely finalize a window.
Apache Flink docs - Equivalence-class hashing
-
Canonicalizing a value to a representative form before hashing, so all members of an equivalence class (any valid answer) hash to the same key and reward equally.
Wikipedia - VRPTW
-
The vehicle routing problem with time windows: route a fleet to serve stops within their allowed time windows at minimum cost. A canonical hard optimization problem.
Wikipedia - SSE
-
Server-sent events: a one-way stream from server to browser over a single long-lived HTTP connection. Simpler than WebSocket when you only push updates downstream.
MDN - WebSocket
-
A protocol for full-duplex communication between browser and server over a single persistent connection, for two-way real-time messaging.
MDN - QUIC / HTTP/3
-
QUIC is a UDP-based transport with built-in encryption and multiplexing without head-of-line blocking. HTTP/3 is HTTP running over QUIC.
MDN - OCR / CER
-
OCR (optical character recognition) extracts text from images. CER (character error rate) measures how wrong the extraction is, as a fraction of characters.
Wikipedia - ACCOUNT_USAGE
-
Snowflake's schema of views exposing account-level metadata: query history, storage, and credit consumption. The basis for most cost-optimization analysis.
Snowflake docs