Language, Automata and Recognizers
The word automaton comes from this tradition, Greek automatos, "self-moving."
Foundations. This first entry to [recognizers] lays down some fundamentals the rest of the project will build on. The definitions, simple understandings, notations, traversing through topics, from language to machines.
ILanguage
An alphabet is a finite set of atomic symbols. Atomic means indivisible, the smallest unit to work with.
A = {a, b, c, ..., x, y, z}
B = {0, 1}
C = {a, b}
D = {dog, and, cats}
A string is a sequence of symbols from an alphabet. We write w for a string and ε for the empty string. Some texts use λ instead. And string is also often called a word or a sentence.
Elephants is a string over A
101010 is a string over B
abba is a string over C
dog and cats is a string over D
A language is a set of legal strings over some alphabet. A formal language is one where there is an unambiguous way of determining whether a string is or is not a member.
L = {0, 1, 10, 11, 100, 101} a language over B
Now what defines legal? That's the grammar. A grammar defines a language: the rules that specify how symbols of an alphabet can be joined to form valid strings. It helps the language designer clarify and communicate the structure.
Syntax is the structure of a language. If we have wrong syntax, that results in no or wrong meaning, which goes into the semantics of the language, the meaning.
We encode grammars in different ways. Perhaps the most common one is Backus-Naur Form (BNF). Every rule has the structure:
Name ::= expansion
The symbol ::= means "may expand into" or "may be
replaced with." A name is also called a non-terminal.
In some texts every name is surrounded by angle brackets
<>. An expansion is an expression containing
terminal and non-terminal symbols, joined together by sequencing and
choice. A terminal is a literal like
"+" or "function", the base tokens of the
language. Non-terminals represent pieces of the structure. Simply
juxtaposing expressions indicates sequencing. A vertical bar
| indicates choice.
Productions, or production rules, are the rules that make up the grammar. They translate a non-terminal to a sequence of one or more non-terminals or terminals. Here is a grammar for balanced parentheses:
balanced ::= "()"
| "()" balanced
| "(" balanced ")"
| "(" balanced ")" balanced
By convention, terminals are lower case, non-terminals are upper case, and S is the start symbol:
T = {a, b} terminals
N = {S, A, B} non-terminals
S = start symbol
The language of languages Matt Might
Languages and Formal Grammars Prof Ross
IIAutomata
Now that we have a basic view of language, grammar, and what some of that entails in computer science, we can go into automata, which is directly connected to language. Chomsky is the one who formalized this connection, classifying languages by the type of machine needed to recognize them.
I like to say: a formal language is a language we can distinguish from others by defining a machine that either accepts that language or not. The grammar is the set of rules that makes the machine accept a string of the language, or reject it. Grammars are implemented as rules in the machine, and the machines are also known as automata.
I'm going to be a little sparse here, and just write a bit on different things in automata theory. But you can view a lot of sources on this, even Chomsky talking about it. Here's a video connecting automata, grammar, and languages. Tight one, but you get a full view there.
An automaton is usually drawn as a directed graph. The circles are states (nodes). The arrows are transitions (directed edges). Directed means each arrow has a direction from a source state to a destination state, so you cannot assume you can travel both ways.
Labels belong on transitions. A transition label is the input symbol that must be read to follow that edge. Finite state means there are finitely many states. It does not mean the number of possible input strings is finite. The automaton can still process strings of any length, because you can traverse edges repeatedly in cycles.
One more essential piece to make it a working machine is the start state and the accepting (final) states. The start state is where you begin before reading input. Accepting states are marked specially (often with a double circle). A string is accepted if, after consuming the entire string, the machine can end in an accepting state. For a DFA there is exactly one path determined by the input. For an NFA there may be multiple possible paths, and the string is accepted if at least one path ends in an accepting state.
The automaton above is a DFA, a Deterministic Finite Automaton. Deterministic: for a given state and input, there is exactly one transition, as you can see stepping through the diagram, each symbol leads to exactly one next state. An NFA however, Non-deterministic Finite Automaton, is the opposite: for a given state and input, there can be multiple possible transitions. The machine conceptually explores all paths at once. You might walk through the machine on one path and not end up accepting, but as long as there is a route that accepts the string, we say that the string is accepted.
Here is a base DFA machine implemented in Python, and a recognizer built on it. The DFA class handles the loop: consume a token, dispatch to the current state method, check if we halted or recognized. A recognizer subclasses it and defines the states.
class DFA:
def __init__(self, internal='start'):
self.internal = internal
self.halted = False
self.recognized = False
def transitionTo(self, state):
self.internal = state
return self
def recognize(self):
self.recognized = True
return self
def consume(self, token):
return getattr(self, self.internal)(token)
@classmethod
def evaluate(cls, string):
cur = cls()
for ch in string:
cur = cur.consume(ch)
if cur is None or cur.halted:
return False
cur = cur.consume(END)
return cur and cur.recognized
class Binary(DFA):
def start(self, token):
if token == '0':
return self.transitionTo('zero')
if token == '1':
return self.transitionTo('oneOrMore')
def zero(self, token):
if token == END:
return self.recognize()
def oneOrMore(self, token):
if token in ('0', '1'):
return self.transitionTo('oneOrMore')
if token == END:
return self.recognize()
Binary accepts valid binary numbers. start branches on the first digit, zero only accepts a lone "0" (no leading zeros), and oneOrMore loops on any digit until END. You can find the full code and other examples at recognizers/foundations.py.
The DFA accepts a regular language, for example the set of valid binary numbers as we saw above. Regular languages have something cool called regular operations: union, concatenation, and star. Doing these operations between regular languages still produces a regular language. They are closed.
Union ∪
A = {mango, orange}
B = {green, orange}
A ∪ B = {mango, green, orange}
Concatenation ·
A = {mango, orange}
B = {green, orange}
A · B = {mangogreen, mangoorange, orangegreen, orangeorange}
Star * (Kleene star)
Concatenate zero or more strings from the language.
L = {a}
L* = {ε, a, aa, aaa, ...}
A = {mango, orange}
A* = {ε, mango, orange, mangomango, mangoorange, orangemango, orangeorange, ...}
A side tangent, and something I found interesting and think will be very useful for my use of automata in coding projects and in [recognizers]: the idea of a weighted automaton. A weighted automaton (or weighted acceptor) assigns a value to a string based on the transitions you take while reading it. Each transition has a weight, and as you traverse a path from the start state to some accepting state, you combine the transition weights to get a path score.
And on transitions: an automata transition defines how a computational model moves from one state to another based on reading specific input symbols, acting as the edge in a transition diagram or the row in a transition table. It is the fundamental mechanism driving the machine's behaviour, allowing it to process input strings.
In addition to regular languages, there are other types. A context-free language is defined by a context-free grammar, where every rule has a single non-terminal on the left side: it can be expanded no matter where it appears. Next up is the context-sensitive grammar, where the rule depends on what surrounds the non-terminal. A rule might say "only expand A to XYZ when it appears between B and C." The context constrains when the rule applies.
And to prove that a language is not regular, we can use the pumping lemma. This is what I am going to use in the next entry, where I am building a JSON PDA parser.
PDA, what does that stand for? Pushdown automaton. A pushdown automaton is like a finite state machine plus a stack, and the stack acts as memory. The machine can read the next input symbol and, at the same time, look at the top of the stack and decide what to do next. Its actions can include pushing a symbol onto the stack, popping the top symbol, or leaving the stack unchanged while it moves to a new state.
The stack lets a PDA handle nesting and matching, like balanced parentheses or strings of the form 0n1n, which a plain finite automaton cannot do because it has no unbounded memory. So a PDA is an FSM plus a stack. Each step depends on the next input symbol (or ε) and the symbol on top of the stack. You push when you need to remember something, and later you pop to match it. For 0n1n: push one marker per 0, then pop one marker per 1, and accept only if you end with the stack back to its start marker.
class BalancedParentheses(PDA):
def start(self, token):
if token in '([{': return self.push(token)
if token == ')' and self.top() == '(': return self.pop()
if token == ']' and self.top() == '[': return self.pop()
if token == '}' and self.top() == '{': return self.pop()
if token == END and self.hasEmptyStack(): return self.recognize()
Single state, all logic in the stack. See an opener? Push. See a closer? Check the top matches, pop. Reach END with an empty stack? Accept.
The pushdown automaton is something that will come up very frequently in [recognizers], and was also a big part of my master's thesis.
A PDA is a machine. So is a DFA, an NFA, a Turing machine. Machine is automaton. They all share the same bones: states, transitions, input, and rules for accepting or rejecting. What changes between them is what kind of memory they have and how powerful that makes them.
Programmers have taken inspiration from this theory, not just implementing automata directly, but borrowing the ideas of states and transitions as patterns in code.
Lydia, videos on YT Lydia
Pushdown Automata Chris Marriott
Pattern Matching and Recursion Reginald Braithwaite
A Brutal Look at Balanced Parentheses, Computing Machines, and Pushdown Automata Reginald Braithwaite
IIICoding Things Like States
So we got machines that accept or reject strings. A simpler question first: how do you write code whose behavior changes based on internal state? A color mixer with two buckets, red and blue. Four ways to wire it up.
aMatch
Branch on the current state. An enum labels each state, the class holds the data separately:
class Color(Enum):
RED = auto()
BLUE = auto()
class Mixer:
def __init__(self):
self.state = Color.RED
self.red = 0
self.blue = 0
def add(self, amount=25):
match self.state:
case Color.RED: self.red = min(255, self.red + amount)
case Color.BLUE: self.blue = min(255, self.blue + amount)
return self
def switch(self):
self.state = Color.BLUE if self.state == Color.RED else Color.RED
return self
m = Mixer()
m.add().add().switch().add() # [BLUE] r=50 b=25
The caller chains .add() and .switch(). The
match block decides which value to bump. Both values persist across
switches.
bTransition table
The Mixer uses match — each state needs a case, and
switch hard-codes the toggle. What if we put the machine
in data instead? Two dicts: one for transitions, one for which
attribute each phase controls. The methods never mention a specific
color:
No match anywhere. phase_next maps each phase to the next
one. phase_attr maps each phase to the attribute it
controls — add uses getattr/setattr
to read and write it.
cState design pattern
The previous patterns let you call anything in any order. The GoF State pattern can enforce a protocol — you must add colors in order (red, then green, then blue). Calling the current color keeps adding. Calling the next color advances. Going backwards is rejected:
class Red:
def red(self, p, a): p.r = min(255, p.r + a)
def green(self, p, a): p.g = min(255, p.g + a); p.state = Green()
def blue(self, p, a): raise ValueError("add green first")
class Green:
def red(self, p, a): raise ValueError("red phase done")
def green(self, p, a): p.g = min(255, p.g + a)
def blue(self, p, a): p.b = min(255, p.b + a); p.state = Blue()
class Blue:
def red(self, p, a): raise ValueError("red phase done")
def green(self, p, a): raise ValueError("green phase done")
def blue(self, p, a): p.b = min(255, p.b + a)
class Palette:
def __init__(self): self.r = self.g = self.b = 0; self.state = Red()
def red(self, a=25): self.state.red(self, a); return self
def green(self, a=25): self.state.green(self, a); return self
def blue(self, a=25): self.state.blue(self, a); return self
p = Palette().red(50).red(20).green(30).blue(10)
# Palette().blue(30) → ValueError: add green first
Palette just delegates. Each state decides what's allowed
— the current color adds, the next color advances, going backwards is
rejected:
dCoroutine
Where the generator is paused IS the state. The code structure
enforces ordering (red, then green, then blue) — you can't skip
phases. send(amount) adds to the current channel.
send(0) advances to the next phase:
def mix_protocol() -> Generator[str, int, None]:
red = green = blue = 0
amount = yield "phase: RED"
while True:
if amount == 0: break
red = min(255, red + amount)
amount = yield f"[RED] rgb({red},{green},{blue})"
amount = yield "phase: GREEN"
while True:
if amount == 0: break
green = min(255, green + amount)
amount = yield f"[GREEN] rgb({red},{green},{blue})"
amount = yield "phase: BLUE"
while True:
if amount == 0: break
blue = min(255, blue + amount)
amount = yield f"[BLUE] rgb({red},{green},{blue})"
yield f"done: rgb({red},{green},{blue})"
No state variable, no enum, no table. The control flow is the state machine. And because the generator remembers where it's paused, you can feed it a flat stream of values — it routes each one to the right channel on its own:
def pour_stream(amounts):
protocol = mix_protocol()
next(protocol)
result = ""
for amount in amounts:
result = protocol.send(amount)
return result
pour_stream([50, 25, 0, 100, 0, 30, 0]) # done: rgb(75,100,30)
IVRecognizers
We define a language by what recognizer is able to accept it. A regular language can be recognized by a DFA: we can construct a finite state machine that reads a string and decides whether it belongs to the language. The machine has a finite number of states, and for every input it halts with a yes or no.
Match (pattern, text) -> text | None
The term recognizer also lives inside the theory of Turing machines, where the definition loosens. A Turing machine recognizer must halt and accept on every input that is in the language. But on input that is not in the language, it is allowed to either halt and reject, or run forever. That running forever is the infinite loop case.
A decider is stronger. It must halt on every input. It accepts if the string is in the language and rejects if it is not. Every decider is a recognizer, but not every recognizer is a decider.
This project is more interested in recognizers in a broad sense, and especially in their DFA and PDA versions, which is what the rest of [recognizers] will build on.
So given a grammar and some source code, the algorithm builds a recognizer. You feed it a string, it tells you yes or no. That is the whole job.
Now if we scope out and look at the bigger picture. If you can define a grammar for a behavior, like tool calling, you can use a recognizer as a reward signal to train an LLM. The model generates an action, the recognizer checks if it is valid, and that yes/no feeds back as reward. RL or SFT, the grammar gives you a clean, automatic way to score whether the model got it right.
Recognizers also work at inference time. At each token step, the recognizer masks out invalid continuations so the LLM can only produce well-formed output. This is constrained decoding.
The recognizer from this article is the same machinery running under both. That is where [recognizers] is heading.
Deciders, recognizers & Turing Machine sketches Professor Malik Magdon-Ismail