Adam Sioud

Projects
Mailman

What

So this is my big project which likley span the whole of the first half of 2026. I've been immersing myself in RL for LLM agents. There are going to be three specific environments that I want to build for this kind of project. I'm specifically targeting one attack area which was inspired by Zapier Mock MCP Environment Challenge. How to mock DBs, scale DBs for training, envs and tasks for Zapier type of workflows. Synthetic databases: generate neutral production-like data, not your real DB, something mocked or synthetically generated that behaves like the real thing. A lot of research and good stuff have been looked at, will note and cite here as time goes on. Project is using a lot and inspired by a lot of Prime Intellect stuff. The first env is fully completed, you can read below, email-to-cc-bcc. Other than that I describe and explore the stack, some findings, and a little on the DB stuff. This article is very hacky at the moment, to be continued.

The Stack

The environment is a verifiers environment, which does a lot for us: state management, the multi-turn conversation loop, trajectory tracking, and rubric scoring. To work with it we need a trainer, and that can be anything that speaks the protocol verifiers accepts: SkyRL, Prime RL, whatever. I use t2f-trainer, a fork of verifiers-rl extracted into its own package. The connection point between the model, the environment, and the trainer is the env.generate() function in verifiers. And that's it.

PieceWhat
model HuggingFace model ID. Loaded into vLLM for inference and into transformers/deepspeed for training.
environment A verifiers environment: dataset + reward functions + conversation loop. Versioned Python package, importable, reproducible.
trainer Orchestrates rollouts, computes advantages, runs the training loop, syncs weights back to vLLM via NCCL.

Two GPUs on one node: one for inference (vLLM), one for training. After each training step, the trainer pushes updated weights to the inference server over NCCL so the next batch of rollouts uses the latest policy.

Verifiers Environment

What is an Environment

Every environment in verifiers is a multi-turn environment. That's the core class. A single-turn environment is just a multi-turn environment with turns set to 1, for example a simple question and answer environment.

An environment packages three things: a dataset (task inputs), a harness (tools, sandboxes, context), and a rubric (reward functions that score the model). The environment is a loop: the model responds to something, the environment reacts, repeat until done.

The dataset gives you the starting state: a prompt and an info field with whatever the environment needs (follow-ups, ground truths, seeds). What turns a simple Q&A dataset into an agentic environment is env_response, the function that reacts to each model turn and feeds back the next message.

The Rollout Loop

Each rollout gets a state dict that lives for its full duration. It carries the dataset fields (state["info"], state["prompt"]), runtime data that verifiers populates as the rollout progresses (trajectory, completion, reward), and anything you store there yourself. state is passed to every function in the rollout: setup_state, env_response, stop conditions, cleanup, and reward functions all read from and write to the same object.

What happens
1setup_state(state) prepares per-rollout resources
2Model sees the conversation so far, generates a response
3Stop conditions checked. If done, go to 6
4env_response(messages, state) reacts: returns new messages to append
5Back to 2
6Final completion rendered, rubric scores the rollout, cleanup runs

env_response is the heart of it. For a "pre-planned" environment, it pulls the next follow-up from state["info"]. For a tool environment, it executes the model's tool calls and returns results. Verifiers doesn't know or care what happens inside, it just expects messages back.

python
# simple: pull follow-up from dataset
async def env_response(self, messages, state, **kwargs):
    turn = len([m for m in messages if m["role"] == "assistant"]) - 1
    return [{"role": "user", "content": state["info"]["follow_ups"][turn]}]

# tool: execute and return results
async def env_response(self, messages, state, **kwargs):
    tool_calls = parse_tool_calls(messages[-1])
    results = []
    for call in tool_calls:
        result = await state["server"].call_tool(call.name, call.args)
        results.append({"role": "tool", "content": result})
    return results

Lifecycle

Verifiers gives you hooks at each stage of a rollout. You set things up, react to each turn, decide when to stop, and clean up.

HookWhenWhat you do
setup_statestart of rolloutSpin up databases, servers, per-rollout resources. Store handles in state.
env_responseafter each model turnReact to the model's output. Return the next messages.
@vf.stopafter each turnCustom stop conditions. Return True to end. Built-in max_turns_reached is always there.
@vf.cleanupend of rolloutTear down per-rollout resources. Must be idempotent.
@vf.teardownenvironment shutdownClean up anything shared across all rollouts.

State

state is a mutable dict that lives for the duration of a rollout. It's accessible everywhere: in env_response, stop conditions, cleanup, and reward functions. Verifiers populates some fields automatically. You add whatever your environment needs.

Field
state["prompt"]initial prompt messages
state["info"]per-example data from the dataset (ground truths, seeds, follow-ups)
state["trajectory"]list of steps, each with prompt, completion, reward per turn
state["completion"]final rendered completion (after rollout ends)
state["reward"]final reward (after scoring)

You store your own resources in state too: database connections, server handles, flags. Reward functions can read anything in state, and earlier reward functions can store computed values for later ones to consume.

Scoring

Final rubric scoring happens after the rollout completes. The rubric is a list of reward functions with weights. Each function receives what it asks for by name: completion, state, info, answer. Returns a float, typically 0.0 to 1.0. Intermediate per-turn rewards can also be attached during the rollout via trajectory steps.

python
async def correctness(completion, state, **kwargs):
    expected = state["info"]["ground_truths"]
    actual = parse_answer(completion)
    return 1.0 if actual == expected else 0.0

rubric = vf.Rubric(funcs=[correctness], weights=[1.0])

Multiple reward functions combine via weighted sum. Use add_metric() with weight=0 to track things without affecting the final reward. Which is neat.

Topology of a Training Step

One full cycle. The orchestrator prepares what the trainer needs: rollouts, advantages, microbatches. The trainer re-evaluates the same tokens under the current policy, computes loss, and updates weights. Then syncs to vLLM so the next batch generates with the new policy.

TRAINER PROCESS Trainer forward pass loss + backprop weight update weight sync 4-6 Orchestrator calls env.generate() computes advantages (reward - group_mean) packages microbatches input_ids . loss_mask . logprobs . advantages 1-3 microbatch vLLM hosts model weights OpenAI-compatible API returns completions + token IDs + logprobs accepts weight sync Verifiers (env) multi-turn conversation loop env_response() between turns reward functions score rollout returns scored trajectory B NCCL weight sync env.generate() completions tokens + logprobs A

Channel A: Verifiers calls vLLM via OpenAI HTTP API for completions. Channel B: Trainer pushes updated weights to vLLM via NCCL.

The orchestrator lives inside the trainer process but runs in a background thread. It calls env.generate() with a vLLM client. Verifiers drives the multi-turn loop, calls vLLM for completions (channel A), scores the rollout with reward functions, and returns trajectories with token IDs, log-probs, and rewards. The orchestrator then computes group-relative advantages (reward minus group mean), packages everything into microbatches, and hands them to the training loop.

The trainer takes each microbatch, runs a forward pass on the same tokens to get fresh log-probs under the current policy, computes the importance ratio (trainer log-probs minus inference log-probs), multiplies by the advantage, and backpropagates. After processing all microbatches, one optimizer step updates the weights. Then NCCL broadcasts the new weights to vLLM (channel B) so the next batch of rollouts uses the updated policy. That is one step. Repeat.

GRPO

Group Relative Policy Optimization. The model generates multiple rollouts for the same prompt. Each rollout gets a reward from the environment. The advantage is just the reward minus the group mean: rollouts that scored better than average get a positive advantage, rollouts that scored worse get a negative one.

A concrete example. The model tries the same prompt three times:

AnswerRewardAdvantage
Rollout 1 a, b, c, d 1.0 +0.33
Rollout 2 a, b, d, c 0.0 -0.67
Rollout 3 a, b, c, d 1.0 +0.33

Average reward: 0.67. Rollout 2 scored below average, so its advantage is negative: the gradient pushes its log-probs down. Rollouts 1 and 3 scored above average, so their log-probs get pushed up. The model can look at what it did differently between rollout 1 and rollout 2 and adjust accordingly.

email-to-cc-bcc

Prime Intellect environment hub  ·  Synthetic data  ·  Dataset  ·  Trained model

Email-to-CC-BCC is an environment built to evaluate and train models on recipient placement via RLVR (GRPO), using the verifiers framework and t2f-trainer. The thinking is that the task is a good fit for RL because it has a clear verifiable reward signal (set match against ground truth), structured output (JSON), and enough ambiguity in the CC/BCC split that a model has to reason about roles, sensitivity, and hierarchy rather than pattern match. A synthetic dataset is generated using NVIDIA NeMo DataDesigner, then used to train Qwen models and compare against OpenAI baselines. The environment supports 1, 2, and 3 turn rollouts as a form of curriculum.

The task

Given 7 people and an email thread, assign each person to To, CC, or BCC. The thread evolves across turns and recipients shift.

Each row has 7 people with name, email address, and role. The model sees the full roster and an email, then outputs a JSON object placing each relevant person into the correct field. On follow-up turns, a new email arrives in the thread and the model re-evaluates.

turn 1
You are given an email thread. For each email, assign
the correct To, CC, and BCC recipients. Recipients may
change between emails as the situation evolves.

Roster:
- Sarah Chen <sarah.chen@acme.com> - Project Lead
- Mike Torres <mike.t@clientcorp.com> - Client PM
- Lisa Park <lisa.park@acme.com> - VP Engineering
  ...

Subject: Q3 deadline extension request

Hi team, I wanted to flag that we're going to need
a 2-week extension on the Phoenix deliverables...

Reply with JSON only: {"to": [...], "cc": [...], "bcc": [...]}
Use email addresses, not names.
json
{
  "to": ["sarah.chen@acme.com"],
  "cc": ["mike.t@clientcorp.com"],
  "bcc": ["lisa.park@acme.com"]
}

max_turns controls how many turns the environment runs (1, 2, or 3). Scoring uses Jaccard index per field per turn, averaged across all turns in the rollout.

The dataset

Verifiers uses HuggingFace datasets natively, so hosting on HuggingFace and loading via load_dataset is the natural thing to do.

The dataset is a flat parquet with 7 columns. One row is one complete 3-turn scenario:

ColumnDescription
email_listThe 7 people, always present. Each entry is name <email> - role.
question_1Initial email (subject + body). The environment prepends email_list and appends the instruction at runtime.
question_2, 3Follow-up replies as the situation evolves (people added, removed, escalation, etc.).
answer_1, 2, 3Ground truth JSON: {"to": [...], "cc": [...], "bcc": [...]}

load_environment reshapes the flat columns into the verifiers format:

python
"prompt": [{"role": "user", "content": email_list + question_1 + instruction}],
"info": {
    "follow_ups": [question_2, question_3],
    "ground_truths": [answer_1, answer_2, answer_3],
    "num_turns": max_turns,
}

Generation uses DataDesigner. Each row is built in three steps:

  1. Sample people - 7 per row with name, email, role. 2-6 start active, rest are reserve. One company domain per row, one external domain. Roles sampled without replacement.
  2. Assign recipients per turn - deterministic. Sensitivity drives BCC, hierarchy drives To vs CC. Changes between turns retry if they'd be a no-op.
  3. LLM writes email content - given the roster and visible recipients (To/CC only, no BCC), generate realistic email thread. v1 used google/gemini-2.5-flash via OpenRouter.

The first version of the dataset (v1) used a random shuffle to assign recipients to To, CC, and BCC. The labels were not derivable from the email content. Running frontier models (GPT-5, Claude Sonnet 4.6, GPT-4.1-mini) against it confirmed the problem: they all hit the same ~0.44 ceiling. Looking at individual rollouts, the models were often more correct than the ground truth. Training on those labels would have pushed models away from good judgment and toward matching random assignments.

The v2 dataset rewrote the data generation pipeline. Recipient routing is now deterministic: the person being asked to act goes to To (based on role priority), managers and stakeholders go to CC (based on hierarchy), compliance observers go to BCC (only from the reserve pool, never from active participants). The LLM generating the email content never sees BCC names, so they cannot leak into the text. A post-generation validator rejects rows where any invariant is broken. ~5000 rows passed validation out of 7500 generated. Scoring uses Jaccard index per field per turn, averaged across the rollout.

The ceiling

On the v2 dataset with reweighted rubric (To 0.45, CC 0.45, BCC 0.10), frontier models reach ~0.50 reward. The ceiling moved up from v1 because the labels are now learnable, but models still disagree with the ground truth on edge cases. I picked Qwen3-0.6B as the training target, which scored 0.253 on the baseline.

Before training, targeted evaluations revealed the failure modes. On three-turn rollouts, the 0.6B model collapsed: 40% of turn-3 responses had no JSON at all (stuck in its thinking block). On single-turn, format was clean (100% valid JSON), but only 92% used email addresses as instructed.

ModelTurnsRewardToCC
GPT-4.1-mini30.4880.6830.373
GPT-4.1-mini10.4700.5930.425
Qwen3-8B30.4820.6320.409
Qwen3-8B10.4290.5390.384
Qwen3-0.6B30.1150.0500.194
Qwen3-0.6B10.1910.1240.282

For GPT-4.1-mini and Qwen3-8B, three turns scores higher than single-turn because later emails make the action owner more explicit. The 0.6B model shows the opposite: single-turn nearly doubles reward because the model cannot survive the multi-turn format burden.

The approach

Start simple. Single-turn only to remove multi-turn format collapse. Add two small format rewards to the rubric: format_correct (binary, valid JSON with exactly the right keys) and email_format (fractional, proportion of recipients that are email addresses), each at 0.05 weight. The format reward is scaffolding: it occupies the first phase of training, then becomes invisible once solved.

FunctionWeight
to_correct0.40
cc_correct0.40
bcc_correct0.10
format_correct0.05
email_format0.05

Training: GRPO with LoRA on Qwen3-0.6B. Two A6000 48GB GPUs (one inference via t2f-vllm, one trainer). 100 steps, 512 rollouts per step, lr=1e-5. Took 1h52m.

Results

Evaluated on a 50-example sample from a 91-example held-out test split generated independently from the training data.

Single-turn (held-out test set).

RewardToCCFormat
Qwen3-0.6B (base) 0.2340.0810.2451.000
Qwen3-0.6B (trained) 0.7100.8220.6951.000
GPT-4.1-mini 0.4860.5400.3901.000

Three-turn (held-out test set).

RewardToCCFormat
Qwen3-0.6B (base) 0.1880.0510.2040.807
Qwen3-0.6B (trained) 0.4490.5290.4060.709
GPT-4.1-mini 0.5370.6240.4341.000
Training curves over 100 steps
0 0.5 1.0 0 25 50 75 100 GPT-4.1-mini 0.6B baseline reward to cc format

The training had three phases. In the first ~20 steps, the model learned formatting: format_correct went from 0.18 to 1.0. Schema validity acts as a gate on the rest of the reward: To/CC/BCC score zero unless the output parses as valid JSON. So the model had to solve format first to unlock any semantic credit. In steps 20-50, with format solved, reward shifted to recipient placement. In steps 50-100, To and CC continued climbing steadily.

On single-turn, the trained model decisively outperforms GPT-4.1-mini on unseen test examples. To: 0.822 vs 0.540. CC: 0.695 vs 0.390. Overall reward: 0.710 vs 0.486. A much smaller model, trained for under two hours, exceeded a frontier API model on the narrow task it was trained for. This is a strong example of the RLVR thesis: targeted training with a clear reward signal can produce specialist models that beat stronger generalist models.

On three-turn, GPT-4.1-mini wins (0.537 vs 0.449). The gap is driven primarily by format collapse: the trained model drops to 71% format compliance by turn 3, while GPT maintains perfect format throughout. On valid turns, the trained model's placement quality carries over from single-turn. The three-turn result is not a failure of the approach; it isolates the remaining problem. Recipient placement improved, but the model cannot reliably preserve the output contract across turns it was never trained on.

GPT-4.1-mini's main failure mode is the "all-in-To" pattern: it dumps everyone into To and leaves CC empty. The trained model learned the distinction between "must act" (To) and "kept in the loop" (CC). In examples where the trained model scores above 0.9 while GPT scores below 0.5, the pattern is always the same: the trained model correctly separates one action owner into To and puts the rest in CC, while GPT puts everyone in To.

After training, the model also became more concise: thinking blocks dropped from ~3800 characters to ~760 characters (5x shorter). This was not explicitly rewarded. It is consistent with an indirect pressure: reward depends only on the final JSON, and shorter completions leave less room for drift.

What's next

What remains is now is multi-turn format persistence. Phase B: could be SFT on the dataset's ground-truth answers (full 3-turn conversations with clean JSON) to teach format persistence across turns. Phase C then could become: GRPO on max_turns=2 and 3 to optimize placement on later turns. The single-turn routing "skill" should transfer.

Doltgres

I looked at a bunch of Postgres-based options for sandboxing and lightweight DB environments: containers, in-process builds, things like PGlite. PGlite is WASM, Postgres compiled to WebAssembly, single process, quick, spins up instantly. Docker is slow and Postgres is big, so the question is what you can use that behaves like Postgres but is fast to spin up, fork, and throw away. Doltgres is native Go, which is nice. It is the one I found most interesting for agents and the one I am working on now. It goes nicley into the synthetic database problem.

so Doltgres is a Postgres-wire server built on the Dolt engine. It lets Postgres clients connect normally while exposing Dolt's version-control model through SQL. The core idea is Git for database state: you can branch, diff, merge, and commit database changes, with row-level history underneath. Every transaction can be turned into a versioned commit. An agent needs a safe copy of the world it is about to change, maybe not a sandbox, perhaps not a container, but maybe a branch should work. Doltgres is currently in pre-alpha, heard they talked about proper release late 2026, nice! It has the same storage engine as Dolt.

Why it matters for agents

There are two interesting applications of Doltgres I see:

  • RL: forkable state, fast reset, isolated concurrent worlds, diffable outcomes, replayability. The database as a training environment.
  • Product layer: human-readable versioned history, debugging, benchmarking, artifacts. The database as an auditable record.

Doltgres handles both because branching is cheap (no VMs, no containers, just a logical fork inside the same process) and the history is rich (row-level diffs, commit messages, per-row timelines).  More to come.

Back to Projects
Corner drawing