Environment I
Email-to-CC-BCC: multi-turn recipient assignment from inbox context
Environment II
Alphabet Sort: sequence ordering as a minimal verification testbed
Environment III
SQL from context: query generation against synthetic databases
Submarined
Data generation, synthetic databases, RL trainers, and the infrastructure underneath
Mailman is the working name for a set of reinforcement-learning environments, training infrastructure, and data-generation pipelines built around post-training large language models. The project spans three concrete environments, each designed to test a different aspect of RL on language: multi-turn reasoning, minimal verification, and structured output against real schemas.
Email-to-CC-BCC
The model reads an email thread and assigns recipients to To, CC, or BCC. Multi-turn, ground-truth verifiable, built on synthetic data from DataDesigner.
Alphabet Sort
Sort a shuffled list of words alphabetically. A minimal environment where correctness is trivially checkable and the reward signal is clean.
SQL from Context
Generate SQL queries against synthetic database schemas. Tests structured output, schema comprehension, and execution-verified correctness.
Each environment plugs into the same training stack: Verifiers for the environment abstraction and reward scoring, and a custom RL trainer (t2f-trainer) implementing GRPO with async vLLM generation and NCCL weight synchronization.
The project also includes work on synthetic data generation, building databases and datasets that environments can train against, and the operational infrastructure for running RL training at scale: tmux orchestration, GPU allocation, weight broadcasting, and observability through Weights & Biases.