Adam Sioud

Mailman
RL Post-Training on LLMs
Last edited 14 March 2026

Environment I

Email-to-CC-BCC: multi-turn recipient assignment from inbox context

Environment II

Alphabet Sort: sequence ordering as a minimal verification testbed

Environment III

SQL from context: query generation against synthetic databases

Submarined

Data generation, synthetic databases, RL trainers, and the infrastructure underneath

Mailman cover art

Mailman is the working name for a set of reinforcement-learning environments, training infrastructure, and data-generation pipelines built around post-training large language models. The project spans three concrete environments, each designed to test a different aspect of RL on language: multi-turn reasoning, minimal verification, and structured output against real schemas.

Environment I

Email-to-CC-BCC

The model reads an email thread and assigns recipients to To, CC, or BCC. Multi-turn, ground-truth verifiable, built on synthetic data from DataDesigner.

Environment II

Alphabet Sort

Sort a shuffled list of words alphabetically. A minimal environment where correctness is trivially checkable and the reward signal is clean.

Environment III

SQL from Context

Generate SQL queries against synthetic database schemas. Tests structured output, schema comprehension, and execution-verified correctness.

Each environment plugs into the same training stack: Verifiers for the environment abstraction and reward scoring, and a custom RL trainer (t2f-trainer) implementing GRPO with async vLLM generation and NCCL weight synchronization.

The project also includes work on synthetic data generation, building databases and datasets that environments can train against, and the operational infrastructure for running RL training at scale: tmux orchestration, GPU allocation, weight broadcasting, and observability through Weights & Biases.

Corner drawing