The reason RLVR works is the reward, not the model
The most reliable post-training results of the last two years came from a narrow setup: reinforcement learning with verifiable rewards. Instead of a human rating outputs or a second model guessing at quality, a rule-based checker scores each generation against objective, task-grounded criteria. The model generates, the checker returns a verdict, the policy updates against that verdict.
It works best in two places: mathematical reasoning, where an answer is right or wrong, and code generation, where a compiler and a test suite decide. The common thread is not the domain. It is that a cheap, deterministic verifier exists. Where you can compute correctness without a human and without a learned reward model, RL has a clean signal to optimise against. Where you cannot, you are back to expensive labels and reward models that drift.
That framing is worth bringing to advertising, because it changes which problems are tractable. Most of advertising is not verifiable. A few corners of it are. VAST is one of them.
Most adtech signals make terrible rewards
Think about what you would actually want a media or creative model to optimise. Creative quality is subjective and needs a human or a learned judge. Fill rate, click-through, completion, and revenue are real signals, but they are noisy, they arrive hours or days late, and they are confounded by auction dynamics, audience, and seasonality. Worse, they are gameable: a model can learn to produce tags that score well on a proxy metric without being correct, which is exactly the failure mode verifiable rewards were meant to avoid.
None of those signals give you what a verifier gives you: an immediate, deterministic, ungameable verdict on a single output, with no human in the loop and no second model to train and maintain. For most of advertising, that verdict does not exist.
For VAST, it does. A VAST tag either conforms to the IAB specification at its declared version or it does not. The check is structural, it is grounded in a published standard, and it returns the same answer every time. That is the precise property that made code and math good RLVR domains, and it is rare enough in advertising to be worth building on.
Why VAST is verifiable when the rest of the stack is not
A VAST tag is a contract, not an opinion. The spec says an InLine ad must carry at least one Impression, that a MediaFile has a bitrate and a delivery type, that wrapper chains terminate, that a Duration is present and well-formed. These are not matters of taste. They are checkable against the IAB VAST 2.0 through 4.3 specifications, and a violation is a violation regardless of who is looking.
The catch that makes this useful rather than trivial: a broken VAST tag is still well-formed XML. It parses. A model generating tags will happily produce output that loads cleanly and is wrong, missing a required Impression, pointing at an insecure MediaFile, burying the creative four wrappers deep. None of that throws. So 'it parsed' is not a reward. 'It conforms to the spec' is, and computing that is exactly what a linter does.
vastlint is that linter, built as one Rust core with a Python binding that runs in-process. For a training loop, the binding is the relevant part: validation is a function call, not a subprocess and not a network hop, and it returns a structured result you can turn into a reward in a couple of lines.
What the verifier has to give a training loop
- Determinism: the same tag scores the same every step, so the reward is stable and the policy is not chasing a moving target.
- Speed in-process: training and rejection sampling touch millions of generations, so a subprocess or network call per sample would dominate the loop. A function call against the Rust core clears a production-size tag in well under a millisecond.
- Structured output, not a verdict string: per-rule severities and counts, so you can shape a dense reward instead of a single pass or fail bit.
- Spec grounding: every rule traces to a section of the IAB VAST spec, so the thing you optimise toward is standards conformance, not a vendor heuristic the model can overfit.
- Version awareness: the tag declares a version, and the checker validates against that version's rules, so the model is rewarded for being correct at the spec it claims.
import vastlint # Verifiable reward: deterministic, sub-millisecond, no human, no learned judge.def vast_reward(xml: str) -> float: result = vastlint.validate(xml) if result.valid: return 1.0 s = result.summary return -1.0 * s.errors - 0.25 * s.warnings # shaped, not just pass/fail # Rejection sampling: turn a base model into a clean SFT set without labels.def build_sft_examples(prompt: str, samples: list[str]) -> list[dict]: return [ {"prompt": prompt, "completion": xml} for xml in samples if vastlint.validate(xml).valid ]Get VAST spec updates, platform guides, and release notes in your inbox.
Binary is potent. Dense is better.
The simplest version of this is a binary reward: valid is one, invalid is zero. That is enough to move a model, and it is how most verifiable-reward setups start. But a binary signal is sparse. Early in training the model is wrong almost every time, so almost every sample returns zero and there is little gradient to learn from.
Because vastlint returns per-rule counts rather than a single bit, you do not have to stay binary. A tag with one error is closer to correct than a tag with ten, and you can say so: weight by severity, penalise an insecure MediaFile more heavily than a missing recommended mezzanine, reward getting from ten errors to one even when the tag is not yet clean. Recent RL work keeps finding the same thing, that when the verifiable reward is sparse it pays to make it dense, and the structured result is what lets you do that here without inventing a separate scorer.
Per-rule output is also what makes the model debuggable during training. Track the valid rate across checkpoints and you get one number. Track which rule IDs fail most and you get a map of what the model has not learned yet, so a regression shows up as a specific spec violation instead of a number that quietly slid.
Where a VAST reward actually fits
The obvious target is any model that writes or repairs ad tags: a creative agent that emits a VAST document, a fine-tuned model that takes a broken tag and returns a fixed one, an SSAI or templating system that assembles tags from parts. In agentic pipelines this lines up with the Ad Context Protocol, where a build_creative step can produce a vast asset and a sync_creatives step hands it to a seller. Training that generator against a verifier means fewer rejections downstream, because the tag was scored against the same standard the seller will check it with.
One honest boundary. Spec validity is necessary, not sufficient. A tag that passes vastlint can still fail to play in a specific player, exchange, or CTV environment, because those impose constraints beyond the standard. So validity is a reward component, the verifiable floor, not the whole objective. That is a feature, not a limitation: it is precisely the part of the problem that can be scored without a human, which is what makes it the right place to apply RL with verifiable rewards. Use it for the floor, and keep human or metric-based signals for the parts that genuinely are not verifiable.
The point is narrow and, hopefully, useful. Advertising has very few places where you can score a model output deterministically and cheaply. Ad-tag conformance is one of the clearest, the verifier already exists, and it is a pip install away.
Related docs on vastlint
The Python package: install, the structured result shape, and the training, agentic, and AdCP use cases.
Where validation sits in MCP, AAMP, and AdCP agent workflows.
The same Rust core exposed as an MCP tool for agents that call validation over the protocol.
The rule catalog and severities the verifier scores against.
Authoritative references
Overview of the post-training paradigm: rule-based verifiers as direct supervision, with success concentrated in math and code.
On extending verifiable-reward RL beyond the math and code domains where it first worked.
Evidence that dense, shaped reward beats sparse binary reward when verifiable signals are sparse.
The published IAB standard that defines what conformance, and therefore the reward, means.
The MCP-based protocol whose vast creative asset a trained generator would produce and a seller would validate.
Put a verifier in your training loop
vastlint is a deterministic, sub-millisecond VAST verifier with a zero-dependency Python binding. Install it and turn spec conformance into a reward signal.
Read the Python guide