terminal2F
A lab-like project. Actually, the lab you could say. Where research on agents, systems, ML and more will happen. A small research harness for running agents... which can develop into becoming a real research platform.
Do not want it to be too opinionated, so I give primitives and build around some methodologies that I reckon to be smart to look at. Automata for agents and orchestration, and Bayesian methods for orchestration. As I say, research is the product, model is the byproduct. terminal2F is made for doing research, and perhaps at some point training LLMs too. It is also suitable for industry use cases.
Viewer
A big part of terminal2F is the viewer to view your experiments live or after. Rerun states: "Not a general visualization tool: We're specialized for physical, multimodal, time-series data." Well I am using it for all kinds of setups, and I do think using it for agent observability is fruitful. All experiments stream to Rerun, where you get a timeline you can scrub through, zoom in on, and actually see what happened when things break. Live, after, share it. The data platform also supports remote storage, so you could have a shared catalog where multiple researchers publish their experiments and others can browse, replay, and compare them. Like a shared lab notebook but for agent experiments. And because Rerun supports live streaming, terminal2F is not just for research. The same observability stack that helps you debug experiments can monitor deployed agent systems in real-time.
Data Model
The data platform underneath is built on Rerun's catalog, which uses DataFusion. Experiments, runs, episodes, and metrics are all stored and indexed locally. By using DataFusion under the hood, we can do this, which I think is powerful:
-- compare runs by policy
SELECT run_id, policy, AVG(total_return)
FROM episodes
GROUP BY run_id, policy
Benchmarking
With benchmarking, the scaffold around the model has a big impact. Some good sources to have as groundwork for building benchmarking: Why Benchmarking is Hard (failure modes and reproducibility), Evals FAQ (practical tips), ATIF Trajectory Format (trajectory tracking).
Further reads: Will Ayd & Matt Topol - Practical Applications of Apache Arrow (PyData Virginia 2025), Machine learning data catalog: A 2026 in-depth look, Logging sucks.
Automata
My own work on P2Engine explored multi-agent systems from a systems level, looking at what makes them autonomous and reliable: orchestration, observability, and the memory architecture underneath. terminal2F takes these ideas forward, and continues the use of automata. The idea is that these systems could be made more reliable by being reasoned about, debugged, and rolled back, all enabled by modelling them as automata.
It was inspired by Erik's State Machines for Agents which gave me the engineering pattern: model agents as state machines on a shared clock. Four interaction types (LLM, Tool, Agent, User), each as paired request/response states. An interaction stack tracks state history. The stack top is the current state. Each tick, every agent executes its transition and pushes the result. Deterministic, debuggable, rollback-friendly.
After my master thesis, I tried to read more about this pattern, and stumbled upon Are Agents Probabilistic Automata? (Koohestani et al.) which formalizes it further: the memory architecture of an agent determines its computational class. Bounded context gives you a finite automaton (regular languages). A call stack from tool use gives you a pushdown automaton (context-free). Unbounded read/write memory gives you a Turing machine. They are formal expressiveness classes with real implications for what your agent can and cannot do, and whether you can verify its behavior. As others have noted, both in automated research and coding in general, in an era where generation is cheap, verification becomes the foundation of trust. Autoformalization is accelerating, and I think the Koohestani methodology lays good groundwork for that.
In terminal2F, the agent is just a chat completion call. The automaton is where the architecture lives. terminal2F makes a clean split: the Agent calls the model, the Runner owns memory and control policy. agent.act() is the atomic unit, one API call, one response. The runner decides what to do with it: execute tools, append to history, loop again, or stop. Tools are capability on the agent, permission on the runner.
The runners implemented map directly to the hierarchy from the paper:
- LOOP: your typical agent loop. Full chat history as memory. The baseline.
- FSM: explicit state machine with bounded context (k=3). Transitions via state enum.
- PDA: stack-top driven. No explicit state variable, the interaction stack is the pushdown store. Full history rendered for context.
- LBA: PDA plus a bounded read/write scratchpad (16 slots). Linear-bounded.
- TM: PDA plus unbounded read/write memory. Full Turing machine.
Same agent, different runner. The goal is to run the same task through different automata and compare how they behave.
Bayesian
Formulating agents and multi-agent systems as automata, as done in the previous section, gave us a way to reason about them, debug them, roll them back when mistakes are made, and verify their behavior. Bayesian methods add a decision layer, where a controller maintains a belief state over task-relevant latent variables and selects actions by maximizing posterior expected utility or by comparing the value of information against its cost. A component for decision making.
This framing comes from the paper Position: Agentic AI Systems should be making Bayes-Consistent Decisions (Papamarkou et al., 2026), which motivates the Bayesian approach explored in terminal2F. The position they present: LLMs do not need to be explicitly Bayesian, but the AI systems that orchestrate them should make decisions in a way that is at least consistent with Bayesian reasoning. Token-level uncertainty is not decision-relevant uncertainty. A model can be confident at the next-token level while the system is still uncertain about whether the code will pass or which hypothesis is correct. That is why this layer belongs at the orchestration level, and exactly the kind of thing terminal2F is built to explore.
Bayesian Template for Agentic Orchestration
Maintain a belief state over low-dimensional, decision-relevant latent variables, update it from tool and agent outputs, and select actions by posterior expected utility or value-of-information criteria.
1. Task-level belief factoring
2. Messages as observations
3. Reliability-weighted updates
4. Dependence-aware evidence pooling
5. Utility-based and information-based control
6. Cross-task posteriors for routing
7. Belief-state distillation
By following the Bayesian template for agentic orchestration, terminal2F implements a lightweight Bayesian controller that sits between the automata runners and the environment. The controller maintains a posterior over task outcomes or agent reliability, updates from rollout results, and decides what to run next.
VLM
Another natural use case for terminal2F is VLM/VLA, especially given how tightly it is integrated with Rerun.
Ab Extra
Information Theory and Alignment
I'm very eager to test my ideas around information theory and alignment, which has been a major interest of mine for over a year. terminal2F feels like exactly the kind of lab and testbed that could make this research both possible and fun, and actually yielding something.
Right now, a lot of alignment work is focused on the LLM itself, when most of what is actually deployed is more akin to a whole system. The model is one part, but so are the tools, memory, routing, shared artifacts, and the other agents around it. In a multi-agent setting, alignment may depend less on telling agents what rules to follow, and more on how information is structured, exposed, and allowed to propagate through the system. Who knows what, when they know it, what stays hidden, and what gets written into shared space all shape what kinds of cooperation, manipulation, or coordination can emerge.
That is why information design feels so interesting to me. Instead of trying to force human ideas of teamwork onto agents, I am more interested in studying the conditions that make useful collective behavior emerge in the first place. A shared artifact store, limited visibility between agents, asymmetric memory access, or blackboard-style coordination are part of the policy layer underneath the policy layer. They define the affordances of the system.
So the deeper question, to me, is not only how to align agents, but how to build information structures where better alignment is more likely to arise. Things like OpenClaw are already here and being deployed at scale, but the research on how information structures shape alignment in these systems has barely started. If we can study those affordances at the research level now, we can learn what kinds of information structures yield better behavior once deployed in larger systems.
One small project example I would like to explore: two or more agents are given a task, and they can see each other working on it through a shared space, a blackboard. One agent can see what the other is writing, offer hints, build on it, or even erase and rewrite. The other agent can do the same. No direct messaging, just the shared artifact. The question then is what emerges from that structure: cooperation, competition, or something else entirely. What gets preserved, what gets distorted, and how does the information design itself shape the outcome. This is close to what is called stigmergy, indirect coordination through modifying a shared environment, and connects to mechanism design and information design in game theory where the designer/researcher (me) controls the information structure, not the agents' actions, and studies what equilibria emerge. Very interesting.
So the interest is in whether the underlying issue is really information design itself, the network, the information flow, and the structure of the system, rather than alignment policies added on top. Models change constantly, so a top-down alignment approach tuned to one model may not hold up when the model is swapped out or the system is deployed into a different context. The information structure around the agents is more durable than any single model inside it. I find this very interesting, and it feels relevant as another example of a research experiment that can actually be done with terminal2F, and showcase what it can do.
Some people and work I have taken notice of: Lewis Hammond: Multiagent Risks from Advanced AI (HAAISS 2025), Lewis Hammond, I Figured Out How to Engineer Emergence, ACS, Divya Siddarth, Eval awareness.