Adam Sioud | Language, Automata and Recognizers

Exemplars

Language, Automata and Recognizers

1 MARCH 2026

          recognizers/foundations.py
        

Foundations. This first entry to [recognizers] lays down some fundamentals the rest of the project will build on. The definitions, simple understandings, notations, traversing through topics, from language to machines.

          ILanguage
          IIAutomata
          IIIStates
          IVRecognizers
        

ILanguage

An alphabet is a finite set of atomic symbols. Atomic means indivisible, the smallest unit to work with.

notation

A = {a, b, c, ..., x, y, z}
B = {0, 1}
C = {a, b}
D = {dog, and, cats}

A string is a sequence of symbols from an alphabet. We write w for a string and ε for the empty string. Some texts use λ instead. And string is also often called a word or a sentence.

notation

Elephants          is a string over A
101010             is a string over B
abba               is a string over C
dog and cats       is a string over D

A language is a set of legal strings over some alphabet. A formal language is one where there is an unambiguous way of determining whether a string is or is not a member.

notation

L = {0, 1, 10, 11, 100, 101}     a language over B

Now what defines legal? That's the grammar. A grammar defines a language: the rules that specify how symbols of an alphabet can be joined to form valid strings. It helps the language designer clarify and communicate the structure.

Syntax is the structure of a language. If we have wrong syntax, that results in no or wrong meaning, which goes into the semantics of the language, the meaning.

We encode grammars in different ways. Perhaps the most common one is Backus-Naur Form (BNF). Every rule has the structure:

BNF

Name ::= expansion

The symbol ::= means "may expand into" or "may be replaced with." A name is also called a non-terminal. In some texts every name is surrounded by angle brackets <>. An expansion is an expression containing terminal and non-terminal symbols, joined together by sequencing and choice. A terminal is a literal like "+" or "function", the base tokens of the language. Non-terminals represent pieces of the structure. Simply juxtaposing expressions indicates sequencing. A vertical bar | indicates choice.

Productions, or production rules, are the rules that make up the grammar. They translate a non-terminal to a sequence of one or more non-terminals or terminals. Here is a grammar for balanced parentheses:

BNF

balanced ::= "()"
          |  "()" balanced
          |  "(" balanced ")"
          |  "(" balanced ")" balanced

By convention, terminals are lower case, non-terminals are upper case, and S is the start symbol:

notation

T = {a, b}          terminals
N = {S, A, B}       non-terminals
S = start symbol

            Sources & Further
          

            Grammars, parsing, and recursive descent
            Kay Lack

            The language of languages
            Matt Might

            Languages and Formal Grammars
            Prof Ross
          

IIAutomata

Now that we have a basic view of language, grammar, and what some of that entails in computer science, we can go into automata, which is directly connected to language. Chomsky is the one who formalized this connection, classifying languages by the type of machine needed to recognize them.

I like to say: a formal language is a language we can distinguish from others by defining a machine that either accepts that language or not. The grammar is the set of rules that makes the machine accept a string of the language, or reject it. Grammars are implemented as rules in the machine, and the machines are also known as automata.

I'm going to be a little sparse here, and just write a bit on different things in automata theory. But you can view a lot of sources on this, even Chomsky talking about it. Here's a video connecting automata, grammar, and languages. Tight one, but you get a full view there.

An automaton is usually drawn as a directed graph. The circles are states (nodes). The arrows are transitions (directed edges). Directed means each arrow has a direction from a source state to a destination state, so you cannot assume you can travel both ways.

Labels belong on transitions. A transition label is the input symbol that must be read to follow that edge. Finite state means there are finitely many states. It does not mean the number of possible input strings is finite. The automaton can still process strings of any length, because you can traverse edges repeatedly in cycles.

One more essential piece to make it a working machine is the start state and the accepting (final) states. The start state is where you begin before reading input. Accepting states are marked specially (often with a double circle). A string is accepted if, after consuming the entire string, the machine can end in an accepting state. For a DFA there is exactly one path determined by the input. For an NFA there may be multiple possible paths, and the string is accepted if at least one path ends in an accepting state.

Automaton

aababACCEPT

The automaton above is a DFA, a Deterministic Finite Automaton. Deterministic: for a given state and input, there is exactly one transition, as you can see stepping through the diagram, each symbol leads to exactly one next state. An NFA however, Non-deterministic Finite Automaton, is the opposite: for a given state and input, there can be multiple possible transitions. The machine conceptually explores all paths at once. You might walk through the machine on one path and not end up accepting, but as long as there is a route that accepts the string, we say that the string is accepted.

Here is a base DFA machine implemented in Python, and a recognizer built on it. The DFA class handles the loop: consume a token, dispatch to the current state method, check if we halted or recognized. A recognizer subclasses it and defines the states.

python · base

class DFA:
    def __init__(self, internal='start'):
        self.internal = internal
        self.halted = False
        self.recognized = False

    def transitionTo(self, state):
        self.internal = state
        return self

    def recognize(self):
        self.recognized = True
        return self

    def consume(self, token):
        return getattr(self, self.internal)(token)

    @classmethod
    def evaluate(cls, string):
        cur = cls()
        for ch in string:
            cur = cur.consume(ch)
            if cur is None or cur.halted:
                return False
        cur = cur.consume(END)
        return cur and cur.recognized

python · recognizer

class Binary(DFA):
    def start(self, token):
        if token == '0':
            return self.transitionTo('zero')
        if token == '1':
            return self.transitionTo('oneOrMore')

    def zero(self, token):
        if token == END:
            return self.recognize()

    def oneOrMore(self, token):
        if token in ('0', '1'):
            return self.transitionTo('oneOrMore')
        if token == END:
            return self.recognize()

Binary accepts valid binary numbers. start branches on the first digit, zero only accepts a lone "0" (no leading zeros), and oneOrMore loops on any digit until END. You can find the full code and other examples at recognizers/foundations.py.

DFA: Binary

10110ACCEPT

The DFA accepts a regular language, for example the set of valid binary numbers as we saw above. Regular languages have something cool called regular operations: union, concatenation, and star. Doing these operations between regular languages still produces a regular language. They are what we call closed.

Union ∪

A = {mango, orange}
B = {green, orange}

A ∪ B = {mango, green, orange}


Concatenation ·

A = {mango, orange}
B = {green, orange}

A · B = {mangogreen, mangoorange, orangegreen, orangeorange}


Star *  (Kleene star)

Concatenate zero or more strings from the language.

L = {a}
L* = {ε, a, aa, aaa, ...}

A = {mango, orange}
A* = {ε, mango, orange, mangomango, mangoorange, orangemango, orangeorange, ...}

A side tangent, and something I found interesting and think will be very useful for my use of automata in coding projects and in [recognizers]: the idea of a weighted automaton. A weighted automaton (or weighted acceptor) assigns a value to a string based on the transitions you take while reading it. Each transition has a weight, and as you traverse a path from the start state to some accepting state, you combine the transition weights to get a path score.

And on transitions: an automata transition defines how a computational model moves from one state to another based on reading specific input symbols, acting as the edge in a transition diagram or the row in a transition table. It is the fundamental mechanism driving the machine's behaviour, allowing it to process input strings.

In addition to regular languages, there are other types. A context-free language is defined by a context-free grammar, where every rule has a single non-terminal on the left side: it can be expanded no matter where it appears. Next up is the context-sensitive grammar, where the rule depends on what surrounds the non-terminal. A rule might say "only expand A to XYZ when it appears between B and C." The context constrains when the rule applies.

And to prove that a language is not regular, we can use the pumping lemma. This is what I am going to use in the next entry, where I am building a JSON PDA parser.

PDA, what does that stand for? Pushdown automaton. A pushdown automaton is like a finite state machine plus a stack, and the stack acts as memory. The machine can read the next input symbol and, at the same time, look at the top of the stack and decide what to do next. Its actions can include pushing a symbol onto the stack, popping the top symbol, or leaving the stack unchanged while it moves to a new state.

The stack lets a PDA handle nesting and matching, like balanced parentheses or strings of the form 0ⁿ1ⁿ, which a plain finite automaton cannot do because it has no unbounded memory. So a PDA is an FSM plus a stack. Each step depends on the next input symbol (or ε) and the symbol on top of the stack. You push when you need to remember something, and later you pop to match it. For 0ⁿ1ⁿ: push one marker per 0, then pop one marker per 1, and accept only if you end with the stack back to its start marker.

python

class BalancedParentheses(PDA):
    def start(self, token):
        if token in '([{':  return self.push(token)
        if token == ')' and self.top() == '(': return self.pop()
        if token == ']' and self.top() == '[': return self.pop()
        if token == '}' and self.top() == '{': return self.pop()
        if token == END and self.hasEmptyStack(): return self.recognize()

Single state, all logic in the stack. See an opener? Push. See a closer? Check the top matches, pop. Reach END with an empty stack? Accept.

PDA: Balanced Parentheses

The pushdown automaton is something that will come up very frequently in [recognizers], and was also a big part of my master's thesis.

A PDA is a machine. So is a DFA, an NFA, a Turing machine. Machine is automaton. They all share the same bones: states, transitions, input, and rules for accepting or rejecting. What changes between them is what kind of memory they have and how powerful that makes them.

Programmers have taken inspiration from this theory, not just implementing automata directly, but borrowing the ideas of states and transitions as patterns in code.

            Sources & Further
          

            An Introduction to Weighted Automata in Machine Learning
            Awni Hannun

            Lydia, videos on YT
            Lydia

            Pushdown Automata
            Chris Marriott

            Pattern Matching and Recursion
            Reginald Braithwaite

            A Brutal Look at Balanced Parentheses, Computing Machines, and Pushdown Automata
            Reginald Braithwaite
          

IIICoding Things Like States

So we got machines that accept or reject strings. Another use of the same idea is more practical: how do you write code whose behavior changes based on internal state? Consider a color mixer. We start small, then add structure and constraints. Let's see four ways we can write that.

Match: state as a branch.
Data (dicts): transitions as data, easy to extend.
State pattern: state as an object that enforces allowed actions.
Coroutine: state as remembered control flow.

aMatch

Branch on the current state. An enum labels each state, the class holds the data separately:

python

class Color(Enum):
    RED = auto()
    BLUE = auto()

class Mixer:
    def __init__(self):
        self.state = Color.RED
        self.red = 0
        self.blue = 0

    def add(self, amount=25):
        match self.state:
            case Color.RED:  self.red  = min(255, self.red + amount)
            case Color.BLUE: self.blue = min(255, self.blue + amount)
        return self

    def switch(self):
        self.state = Color.BLUE if self.state == Color.RED else Color.RED
        return self

m = Mixer()
m.add().add().switch().add()  # [BLUE] r=50 b=25

The caller chains .add() and .switch(). The match block decides which value to bump. Both values persist across switches.

Color Mixer

bData (dicts)

The Mixer uses match: each state needs a case, and switch hard-codes the toggle. What if we put the machine in data instead? Two dicts: one for transitions, one for which attribute each phase controls. The methods never mention a specific color:

python

class Phase(Enum):
    RED = auto()
    GREEN = auto()

phase_next = {
    Phase.RED:   Phase.GREEN,
    Phase.GREEN: Phase.RED,
}

phase_attr = {
    Phase.RED:   "red",
    Phase.GREEN: "green",
}

class TableMixer:
    def __init__(self):
        self.phase = Phase.RED
        self.red = self.green = 0

    def add(self, amount=25):
        attr = phase_attr[self.phase]
        setattr(self, attr, min(255, getattr(self, attr) + amount))
        return self

    def next(self):
        self.phase = phase_next[self.phase]
        return self

No match anywhere. phase_next maps each phase to the next one. phase_attr maps each phase to the attribute it controls. add uses getattr/setattr to read and write it.

Table Mixer

r=0   g=0   phase: RED

cState design pattern

The previous patterns let you call anything in any order. The GoF State pattern can enforce a protocol: you must add colors in order (red, then green, then blue). Calling the current color keeps adding. Calling the next color advances. Going backwards is rejected:

python

class Red:
    def red(self, p, a):   p.r = min(255, p.r + a)
    def green(self, p, a): p.g = min(255, p.g + a); p.state = Green()
    def blue(self, p, a):  raise ValueError("add green first")

class Green:
    def red(self, p, a):   raise ValueError("red phase done")
    def green(self, p, a): p.g = min(255, p.g + a)
    def blue(self, p, a):  p.b = min(255, p.b + a); p.state = Blue()

class Blue:
    def red(self, p, a):   raise ValueError("red phase done")
    def green(self, p, a): raise ValueError("green phase done")
    def blue(self, p, a):  p.b = min(255, p.b + a)

class Palette:
    def __init__(self): self.r = self.g = self.b = 0; self.state = Red()
    def red(self, a=25):   self.state.red(self, a); return self
    def green(self, a=25): self.state.green(self, a); return self
    def blue(self, a=25):  self.state.blue(self, a); return self

p = Palette().red(50).red(20).green(30).blue(10)
# Palette().blue(30) → ValueError: add green first

Palette just delegates. Each state decides what is allowed: the current color adds, the next color advances, going backwards is rejected:

Palette

r=0   g=0   b=0   state: RED

dCoroutine

A coroutine does not carry an explicit state field. Where the generator is paused is the state. Each yield marks a phase: send(amount) adds to the active channel, while send(0) advances to the next one. The protocol can only move red, then green, then blue:

python

def mix_protocol() -> Generator[str, int, None]:
    red = green = blue = 0

    amount = yield "phase: RED"
    while True:
        if amount == 0: break
        red = min(255, red + amount)
        amount = yield f"[RED] rgb({red},{green},{blue})"

    amount = yield "phase: GREEN"
    while True:
        if amount == 0: break
        green = min(255, green + amount)
        amount = yield f"[GREEN] rgb({red},{green},{blue})"

    amount = yield "phase: BLUE"
    while True:
        if amount == 0: break
        blue = min(255, blue + amount)
        amount = yield f"[BLUE] rgb({red},{green},{blue})"

    yield f"done: rgb({red},{green},{blue})"

pour_stream feeds values into that suspended computation. No enum and no transition table: control flow remembers which channel receives the next value.

python

def pour_stream(amounts):
    protocol = mix_protocol()
    next(protocol)
    result = ""
    for amount in amounts:
        result = protocol.send(amount)
    return result

pour_stream([50, 25, 0, 100, 0, 30, 0])  # done: rgb(75,100,30)

Coroutine: pour_stream

502501000300

next() → "phase: RED"

IVRecognizers

We can define a language by the recognizer that accepts it. A regular language can be recognized by a DFA: we can construct a finite state machine that reads a string and decides whether it belongs to the language. The machine has a finite number of states, and for every input it halts with a yes or no.

Match (pattern, text) -> text | None

The term recognizer also lives inside the theory of Turing machines, where the definition loosens. A Turing machine recognizer must halt and accept on every input that is in the language. But on input that is not in the language, it is allowed to either halt and reject, or run forever. That running forever is the infinite loop case.

A decider is stronger. It must halt on every input. It accepts if the string is in the language and rejects if it is not. Every decider is a recognizer, but not every recognizer is a decider.

This project is more interested in recognizers in a broad sense, and especially in their DFA and PDA versions, which is what the rest of [recognizers] will build on.

So given a grammar and some source code, the algorithm builds a recognizer. You feed it a string, it tells you yes or no. That is the whole job.

Now if we scope out and look at the bigger picture. If you can define a grammar for a structured behavior, like a tool call, a recognizer can check whether an LLM's generated output is valid. For RL, that yes/no result can become a reward signal. For SFT, the same grammar can generate or filter valid training examples.

Recognizers also work at inference time. At each token step, the recognizer masks out invalid continuations so the LLM can only produce well-formed output. This is constrained decoding.

The same recognizer idea sits underneath both. At training time it can score or filter examples. At inference time it can become a prefix checker that masks invalid continuations. That is where [recognizers] is heading.

            Sources & Further
          

            Infinite loops. Deciders versus recognizers
            Professor Malik Magdon-Ismail

            Deciders, recognizers & Turing Machine sketches
            Professor Malik Magdon-Ismail
          

Back to Exemplars