What
Mailman is a deep dive into RL post-training on LLMs. Three environments, synthetic data generation, a custom RL trainer, and the infrastructure to run it all. Each environment is designed to test a different aspect of reinforcement learning on language: multi-turn reasoning, minimal verification, and structured output against real schemas. The environments plug into Verifiers for scoring and t2f-trainer for GRPO training with async vLLM generation and NCCL weight sync.
The Stack
Three pieces: a model, an environment, and a trainer. The environment lives in verifiers, which builds all the machinery around it: state management, the multi-turn conversation loop, trajectory tracking, rubric scoring. The trainer can be anything that speaks the same protocol. I use t2f-trainer, a fork of verifiers-rl extracted into its own package. The connection point is env.generate(): the trainer calls it, verifiers drives the rollouts, and scored trajectories come back.
| Piece | What |
|---|---|
| model | HuggingFace model ID. Loaded into vLLM for inference and into transformers/deepspeed for training. Same weights, two copies. |
| environment | A verifiers environment: dataset + reward functions + conversation loop. Versioned Python package, importable, reproducible. |
| trainer | Orchestrates rollouts, computes advantages, runs the training loop, syncs weights back to vLLM via NCCL. |
Two GPUs on one node: one for inference (vLLM), one for training. After each training step, the trainer pushes updated weights to the inference server over NCCL so the next batch of rollouts uses the latest policy. Environments are versioned Python packages that can be installed, pulled for local editing, evaluated against any OpenAI-compatible endpoint, and pushed back with a version bump.
Topology of a Training Step
One full cycle. The orchestrator prepares what the trainer needs: rollouts, advantages, microbatches. The trainer re-evaluates the same tokens under the current policy, computes loss, and updates weights. Then syncs to vLLM so the next batch generates with the new policy.
Channel A: Verifiers calls vLLM via OpenAI HTTP API for completions. Channel B: Trainer pushes updated weights to vLLM via NCCL.
The orchestrator lives inside the trainer process but runs in a background thread. It calls env.generate() with a vLLM client. Verifiers drives the multi-turn loop, calls vLLM for completions (channel A), scores the rollout with reward functions, and returns trajectories with token IDs, log-probs, and rewards. The orchestrator then computes group-relative advantages (reward minus group mean), packages everything into microbatches, and hands them to the training loop.
The trainer takes each microbatch, runs a forward pass on the same tokens to get fresh log-probs under the current policy, computes the importance ratio (trainer log-probs minus inference log-probs), multiplies by the advantage, and backpropagates. After processing all microbatches, one optimizer step updates the weights. Then NCCL broadcasts the new weights to vLLM (channel B) so the next batch of rollouts uses the updated policy. That is one step. Repeat.
GRPO
Group Relative Policy Optimization. The model generates multiple rollouts for the same prompt. Each rollout gets a reward from the environment. The advantage is just the reward minus the group mean: rollouts that scored better than average get a positive advantage, rollouts that scored worse get a negative one. No separate reward model, no critic network. The model learns from its own successes and failures on the same problem.
A concrete example. The model tries the same prompt three times:
| Answer | Reward | Advantage | |
|---|---|---|---|
| Rollout 1 | a, b, c, d | 1.0 | +0.33 |
| Rollout 2 | a, b, d, c | 0.0 | -0.67 |
| Rollout 3 | a, b, c, d | 1.0 | +0.33 |
Average reward: 0.67. Rollout 2 scored below average, so its advantage is negative: the gradient pushes its log-probs down. Rollouts 1 and 3 scored above average, so their log-probs get pushed up. The model can look at what it did differently between rollout 1 and rollout 2 and adjust accordingly. That is the entire mechanism. The update signal is the importance ratio (how much the policy changed) times the advantage (whether the action was good or bad), masked to only the completion tokens.
email-to-cc-bcc
GitHub · Dataset · Trained model · Environment
email-to-cc-bcc is an environment built to evaluate and train models on recipient placement via RLVR (GRPO), using the verifiers framework and t2f-trainer. Given seven people and an email thread, assign each person to To, CC, or BCC. The thread evolves across turns and recipients shift.
The thinking behind this environment is that email recipient placement is a good testbed for RL on language. It has a clear verifiable reward signal (set match against ground truth), structured output (JSON), and enough ambiguity in the CC/BCC split that a model has to reason about roles, sensitivity, and hierarchy rather than pattern match. There is no single correct answer that can be memorized. The model must read the email, understand who is being asked to act, who is being kept in the loop, and who should be hidden, then produce a JSON assignment. That reasoning is the thing being trained.
The dataset
The first version of the dataset (v1) used a random shuffle to assign recipients to To, CC, and BCC. The labels were not derivable from the email content. Running frontier models (GPT-5, Claude Sonnet 4.6, GPT-4.1-mini) against it confirmed the problem: they all hit the same ~0.44 ceiling. Looking at individual rollouts, the models were often more correct than the ground truth. Training on those labels would have pushed models away from good judgment and toward matching random assignments.
The v2 dataset rewrote the data generation pipeline. Recipient routing is now deterministic: the person being asked to act goes to To (based on role priority), managers and stakeholders go to CC (based on hierarchy), compliance observers go to BCC (only from the reserve pool, never from active participants). The LLM generating the email content never sees BCC names, so they cannot leak into the text. A post-generation validator rejects rows where any invariant is broken. ~5000 rows passed validation out of 7500 generated. The environment supports 1, 2, and 3 turn rollouts as a form of curriculum: not increasing difficulty, but expanding the number of recipient reassignment decisions the model must handle across an evolving thread. Scoring uses Jaccard index per field per turn, averaged across the rollout.
The ceiling
On the v2 dataset with reweighted rubric (To 0.45, CC 0.45, BCC 0.10), frontier models reach ~0.50 reward. The ceiling moved up from v1 because the labels are now learnable, but models still disagree with the ground truth on edge cases. I picked Qwen3-0.6B as the training target, which scored 0.253 on the baseline.
Before training, targeted evaluations revealed the failure modes. On three-turn rollouts, the 0.6B model collapsed: 40% of turn-3 responses had no JSON at all (stuck in its thinking block). On single-turn, format was clean (100% valid JSON), but only 92% used email addresses as instructed.
| Model | Turns | Reward | To | CC |
|---|---|---|---|---|
| GPT-4.1-mini | 3 | 0.488 | 0.683 | 0.373 |
| GPT-4.1-mini | 1 | 0.470 | 0.593 | 0.425 |
| Qwen3-8B | 3 | 0.482 | 0.632 | 0.409 |
| Qwen3-8B | 1 | 0.429 | 0.539 | 0.384 |
| Qwen3-0.6B | 3 | 0.115 | 0.050 | 0.194 |
| Qwen3-0.6B | 1 | 0.191 | 0.124 | 0.282 |
For GPT-4.1-mini and Qwen3-8B, three turns scores higher than single-turn because later emails make the action owner more explicit. The 0.6B model shows the opposite: single-turn nearly doubles reward because the model cannot survive the multi-turn format burden.
The approach
Start simple. Single-turn only to remove multi-turn format collapse. Add two small format rewards to the rubric: format_correct (binary, valid JSON with exactly the right keys) and email_format (fractional, proportion of recipients that are email addresses), each at 0.05 weight. The format reward is scaffolding: it occupies the first phase of training, then becomes invisible once solved.
| Function | Weight |
|---|---|
| to_correct | 0.40 |
| cc_correct | 0.40 |
| bcc_correct | 0.10 |
| format_correct | 0.05 |
| email_format | 0.05 |
Training: GRPO with LoRA on Qwen3-0.6B. Two A6000 48GB GPUs (one inference via t2f-vllm, one trainer). 100 steps, 512 rollouts per step, lr=1e-5. Took 1h52m.
Results
Evaluated on a 50-example sample from a 91-example held-out test split generated independently from the training data.
Single-turn (held-out test set).
| Reward | To | CC | Format | |
|---|---|---|---|---|
| Qwen3-0.6B (base) | 0.234 | 0.081 | 0.245 | 1.000 |
| Qwen3-0.6B (trained) | 0.710 | 0.822 | 0.695 | 1.000 |
| GPT-4.1-mini | 0.486 | 0.540 | 0.390 | 1.000 |
Three-turn (held-out test set).
| Reward | To | CC | Format | |
|---|---|---|---|---|
| Qwen3-0.6B (base) | 0.188 | 0.051 | 0.204 | 0.807 |
| Qwen3-0.6B (trained) | 0.449 | 0.529 | 0.406 | 0.709 |
| GPT-4.1-mini | 0.537 | 0.624 | 0.434 | 1.000 |
The training had three phases. In the first ~20 steps, the model learned formatting: format_correct went from 0.18 to 1.0. Schema validity acts as a gate on the rest of the reward: To/CC/BCC score zero unless the output parses as valid JSON. So the model had to solve format first to unlock any semantic credit. In steps 20-50, with format solved, reward shifted to recipient placement. In steps 50-100, To and CC continued climbing steadily.
On single-turn, the trained model decisively outperforms GPT-4.1-mini on unseen test examples. To: 0.822 vs 0.540. CC: 0.695 vs 0.390. Overall reward: 0.710 vs 0.486. A much smaller model, trained for under two hours, exceeded a frontier API model on the narrow task it was trained for. This is a strong example of the RLVR thesis: targeted training with a clear reward signal can produce specialist models that beat stronger generalist models.
On three-turn, GPT-4.1-mini wins (0.537 vs 0.449). The gap is driven primarily by format collapse: the trained model drops to 71% format compliance by turn 3, while GPT maintains perfect format throughout. On valid turns, the trained model's placement quality carries over from single-turn. The three-turn result is not a failure of the approach; it isolates the remaining problem. Recipient placement improved, but the model cannot reliably preserve the output contract across turns it was never trained on.
GPT-4.1-mini's main failure mode is the "all-in-To" pattern: it dumps everyone into To and leaves CC empty. The trained model learned the distinction between "must act" (To) and "kept in the loop" (CC). In examples where the trained model scores above 0.9 while GPT scores below 0.5, the pattern is always the same: the trained model correctly separates one action owner into To and puts the rest in CC, while GPT puts everyone in To.
After training, the model also became more concise: thinking blocks dropped from ~3800 characters to ~760 characters (5x shorter). This was not explicitly rewarded. It is consistent with an indirect pressure: reward depends only on the final JSON, and shorter completions leave less room for drift.
What's next
What remains is not recipient placement but multi-turn format persistence. Phase B: SFT on the dataset's ground-truth answers (full 3-turn conversations with clean JSON) to teach format persistence across turns. Phase C: GRPO on max_turns=2 and 3 to optimize placement on later turns. The single-turn routing skill should transfer; the model just needs to learn to maintain it.