Large Language Model Post-Training With Math, Some Illustrations, and in Code
This exemplar is my attempt to work through and understand reinforcement-learning post-training for large language models. We will go through some of the underlying math, look at a few illustrations, and study how these ideas appear in code through the RL trainer Verifiers-RL.
I Logits, Softmax, and Log-Probs
When an LLM generates text token by token, (self-feeding, autoregressive$p(y_1, \ldots, y_T \mid x) = \prod_{t=1}^{T} p(y_t \mid x, y_{<t})$
The model reads the entire prefix (prompt + tokens generated so far) and outputs one logit per vocabulary token for the next position. Pick one token, append it to the prefix, repeat. Each next token is conditional on all previous tokens, including the ones just generated.) its neural network produces a score for every entry in its vocabulary at each step. These scores are called logits. The vocabulary here is the model’s fixed tableThe vocabulary is a fixed lookup table used by the tokenizer. In one direction, you give it text and it returns token IDs. Those IDs are the numbers fed into the network. In the other direction, the model produces scores over all token IDs, one gets picked, and the tokenizer maps that ID back into text. The model only ever touches the numbers, not the text itself. Text exists on both sides of the model, but never inside the neural network. that maps token IDs to token strings. The model works with the IDsThe token ID itself is just an index, a row number. The model's first layer is an embedding table: a learned matrix of shape [vocab_size × d_model]. Token ID 4827 → go to row 4827 → pull out a vector of, say, 4096 learned weights. That vector, not the integer, is what flows into the transformer. The weights in this table start random and get shaped by training so that tokens appearing in similar contexts end up with similar vectors., and once one is selected, the tokenizer maps it back into the text token you see.
A higher score means the network favors that token more strongly as the next one, while a lower score means it favors it less. The token you finally see depends not only on these scores but also on the decoding rule used to choose from them. If the decoding rule is argmax, the token produced will simply be the one with the highest score.
The output of the neural network is just a vector of scores as you see by the logits vector there. The index i in that vector corresponds to entry i in the vocabulary table, which maps IDs to tokens.
Now, if we could directly adjust this scoreboard of logits, we could change what token comes out next. But that is not the same as changing the model itself. We can already change the visible output at decoding time by changing the decoding rule, for example by sampling instead of taking argmax. In that case the output changes, but the model has not. The policy, in the training sense, is still the same. RL post-training does not aim to change the decoder or manually alter logits after they are produced. It aims to change the weights so that the model produces different logits on its own.
But logits are still only raw scores: any real number from $-\infty$ to $+\infty$. On their own, they are not yet probabilities, and their scale is not especially interpretable. If one token has a logit of 12.4 and another 11.9, we know the model favors the first, but it is not immediately clear how much that difference matters probabilistically. What we need is a probability distribution: values that are positive and sum to 1. Softmax is the step that converts logits into exactly that.
Given a vector of logits $z$, the probability of token $i$ is the exponential of its logit divided by the sum of exponentials over all tokens in the vocabulary. The exponentialWhy the exponential? It is not the only function that could make values positive, but it has several especially useful properties. It preserves the ranking of the logits, smoothly amplifies differences between them, and leads to a simple, well-behaved normalization. This means that larger logits become disproportionately more likely, while still remaining part of a coherent probability distribution over the whole vocabulary. See Mutual Information, "Why Do Neural Networks Love the Softmax?", $e^{z_i}$, makes every score positive regardless of sign, while dividing by the total sum normalizes the result into a proper probability distribution. A token with a larger logit receives a larger share of the total probability mass, but that share is always determined relative to the full field of alternatives. Softmax therefore makes each token's probability depend on every other token, not just its own score.
Let’s build some intuition for softmax. The illustration below is inspired by Elliot Waite.
a The model assigns one logit to each candidate token. These are raw scores, not probabilities. A token farther to the right is more strongly favored, but the values are not yet easy to interpret on their own.
b To convert these scores into probabilities, we first pass them through the exponential function, ezi. This curve rises steeply to the right, so tokens with larger logits are mapped to disproportionately larger positive values.
c Now project each token up to the curve. The dashed lines show how a small gap between logits becomes a much larger gap in ez values. This is the stretching effect: even a modest advantage in logit space becomes more pronounced after exponentiation.
d Finally, divide each ez value by the total across all candidates. This turns the exponentiated scores into a probability distribution: every value lies between 0 and 1, and together they sum to exactly 1. The tokens are now competing for a fixed amount of probability mass.
Drag the sliders to see how increasing one token's score redistributes probability across all the others.
e Now hold the other logits fixed and vary only $z_4$. Its probability traces out an S-shaped curve: near the extremes, the probability barely moves, but in the middle, where the token is genuinely competitive, small changes in the logit produce large changes in probability. In the binary case this is exactly the sigmoid, $\sigma(x) = \frac{1}{1 + e^{-x}}$. More generally, it is the same logistic intuition appearing inside softmax.
f You have probably heard of the temperature parameter. Before applying softmax, we divide every logit by $T$: $p_i = \frac{e^{z_i/T}}{\sum_j e^{z_j/T}}$. At $T{=}1$ nothing changes. As $T$ increases, the distribution flattens toward uniform, giving lower-ranked tokens a real chance of being sampled. This is why high temperature produces more creative, surprising, and sometimes incoherent outputs. As $T \to 0$, the distribution sharpens toward argmax, concentrating nearly all probability on the top token, making the model deterministic and repetitive. Temperature controls how sharply or loosely the model follows its own preferences at decoding time.
Temperature is an inference-time knob: it changes how we sample from the model's probabilities, but it does not change the model's weights or the policy it has learned. Not what we are interested in here.
Softmax turns a raw score vector into a competitive probability distribution. Once we have those probabilities, we can talk not just about which token was chosen, but how strongly the model preferred it. That is where log-probabilities enter. Taking the logarithm of those probabilities gives log-probs: numbers that are always between $-\infty$ and 0. A probability close to 1 becomes a log-prob close to 0. A tiny probability becomes a hugely negative log-prob. This is useful because the log stretches out differences at the low end: going from probability 0.001 to 0.002 is nearly invisible on a 0-to-1 scale, but in log space it is the difference between $-6.9$ and $-6.2$, a meaningful jump the gradient can act on. So the model gets strong signal when it improves on things it is bad at, and does not waste effort squeezing tiny gains where it is already confident. This is why training works in log space.
$\pi_\theta$ is the policy: the model's behavior as defined by its current weights $\theta$. It takes a state $s_t$ (the sequence so far) and returns a probability distribution over all actions (every token in the vocabulary). $\pi_\theta(a_t \mid s_t)$ is the softmax probability assigned to the token that was actually chosen at step $t$. Although the model produces probabilities for every token in the vocabulary, the policy-gradient loss only needs the log-prob of the sampled token. That one quantity is enough, because its gradient already depends on the whole softmax distribution.
Across policy-gradient-style methods, the central quantity is always the model's probability of the sampled action under the current policy. Sometimes it appears directly as a log-prob, $\log \pi_\theta(a_t \mid s_t)$, as in Policy Gradient, REINFORCE, and RLOO. Sometimes it appears indirectly through a probability ratio such as $\frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}$, as in PPO and GRPO. But it is the same object underneath: how likely the model made the behavior it actually produced.
For LLMs, the action can be viewed at two levels: a single next token at one step, or an entire sampled completion made of many token actions. The probability of the full completion is the product of token probabilities, so its log-prob is the sum of token log-probs. The model is updated token by token, but the feedback often arrives sequence by sequence: a reward model or verifier scores the full completion, and that scalar signal is then used to push up or down the log-probs of the sampled tokens that produced it.
Now imagine you could manually adjust these log-probs. You look at the model's output, you see which token it picked and how confident it was, and you nudge the probability up or down. Push one log-prob up, softmax redistributes, and next time the model is more likely to pick that token. Push it down, and the model avoids it. You are manually reshaping the model's behavior one log-prob at a time.
But here is the thing to keep in mind: the log-prob is an output, not a parameter. You cannot reach into the network and change a log-prob directly. The log-prob comes from the weights. To make a log-prob go up or down, you have to change the weights that produced it. Gradient descent is the mechanism that figures out which weights to nudge and by how much so that the log-prob moves in the direction you want. That is what $\nabla_\theta \log \pi_\theta(a_t \mid s_t)$ means in the formulas: the gradient of the log-prob with respect to every weight in the network. It tells you the direction to push the weights so that this particular log-prob changes.
In Verifiers-RL, the inference server (vLLM) generates a completion and returns the sampled token IDs together with their per-token log-probs. That is what the rollout looks like by the time it reaches the trainer: not logits, not raw probabilities, but token IDs, log-probs, masks, and rewards.
The trainer then runs the same tokens through its own (updated) copy of the model to get fresh log-probs. This is where softmax actually appears in code, inside a function called selective_log_softmax:
# the model forward pass produces logits logits = model(input_ids, attention_mask).logits # selective_log_softmax: logits → log-probs for the sampled tokens only # this is log_softmax(z_i) = z_i - logsumexp(z) trainer_logprobs = selective_log_softmax(logits, targets)
1.75
−2.00
0.50
0.208
0.005
0.060
−1.57
−5.32
−2.82
import torch import torch.nn.functional as F z = torch.tensor([3.0, 1.75, -2.0, 0.5]) probs = F.softmax(z, dim=-1) # → [0.725, 0.208, 0.005, 0.060] log_probs = F.log_softmax(z, dim=-1) # → [-0.322, -1.572, -5.322, -2.822]
The loss itself is then built from the difference between these two sets of log-probs, the ones from inference and the ones from the trainer's current weights:
# importance ratio: how much has the policy shifted? log_importance_ratio = trainer_logprobs - inference_logprobs importance_ratio = torch.exp(log_importance_ratio) # GRPO loss: push up log-probs of good actions, push down bad ones loss = (-importance_ratio * advantages)[keep_mask].sum()
That is the whole loop. The inference server returns token IDs and log-probs. The trainer recomputes log-probs under its current weights. The ratio between them, scaled by the advantage, is the gradient signal that updates the weights. Softmax never appears as an explicit step in the training loop. It is folded into selective_log_softmax, one line of code that turns logits into the log-probs the loss needs.
So the picture is: you see a log-prob you want to be higher, but you cannot touch it directly. You compute the gradient, which tells you how to adjust the weights, and the log-prob changes as a consequence. Now scale that by a signal that says whether the choice was good or bad, and you have training. That signal is called the advantage.
In Ab Extra we take a closer look at three functions that appear in the trainer code: selective_log_softmax, entropy_from_logits, and KL divergence, as well as the function and setup in the codebase where they are used.
Logit and softmax in deep learning Minsuk Heo
Softmax Function Explained In Depth with 3D Visuals Elliot Waite
Why Do Neural Networks Love the Softmax? Mutual Information
Sigmoid and SoftMax Functions in 5 minutes Gabriel Furnieles
II Advantage, Gradient, and Descent
Section I ended with log-probs: the quantity that tells us how likely the model thought a chosen token was. Section II is about how those log-probs become a learning signal that actually changes the model's weights.
The basic policy-gradient idea is simple. The policy assigns probabilities to actions. One action is sampled. The environment or verifier tells us whether that action led to a good or bad outcome. That feedback is turned into an advantage, which says whether the chosen action should become more likely or less likely in the future. The chosen action's log-prob is then combined with that advantage, backpropagation computes the gradients, and gradient descent applies the update.
In compact form, the update signal has the shape
$$\log \pi_\theta(a_t \mid s_t) \cdot A_t$$More precisely, training uses the gradient of this quantity with respect to the weights. But as intuition, this already captures the idea: look at the action the policy actually took, look at whether it turned out better or worse than expected, and then move the weights so that similar actions become more or less likely in the future.
Advantage says what direction the update should go. If the advantage is positive, the chosen action should become more likely. If it is negative, it should become less likely. The gradient says how the weights would have to move to make that happen. Gradient descent is the step that actually applies that change.
The environment itself does not update the model directly. It only produces experience: states, sampled actions, and outcomes. The trainer is what turns that experience into learning. It computes an advantage from the outcome, combines it with the chosen action's log-prob, backpropagates through the network, and updates the weights. So before looking at PPO, GRPO, or any other variant, it helps to understand those two pieces separately: what the advantage is measuring, and why the gradient of the log-prob tells us how to change the weights.
III Ab Extra
Three functions from the trainer that are worth looking at more closely. Each one is built from the same ingredients we covered in Section I: logits, softmax, and log-probs.
In the codebase, all three live in t2f_trainer/rl/trainer/utils.py and are called from t2f_trainer/rl/trainer/trainer.py inside get_logprobs() and compute_loss().
Called in trainer.py → get_logprobs(), defined in utils.py → selective_log_softmax()
selective_log_softmax gives you the log-prob of what was chosen. The model's forward pass produces a logit for every token in the vocabulary at every position, but the trainer only needs the log-prob of the sampled token, not the full softmax table. This is because the gradient of that one log-prob already depends on all the logits through the softmax denominator. Pushing one sampled action's log-prob up or down implicitly reshapes the whole distribution.
It gathers the logit of the sampled token ($z_i$), computes the log-sum-exp across the vocabulary (the normalizing constant), and subtracts. One log-prob per position, one line of math:
# gather the logit of the sampled token selected_logits = torch.gather(logits, dim=-1, index=index.unsqueeze(-1)).squeeze(-1) # compute the normalizing constant (logsumexp over vocabulary) logsumexp_values = torch.stack([torch.logsumexp(lg, dim=-1) for lg in logits]) # log_softmax(z_i) = z_i - logsumexp(z) per_token_logps = selected_logits - logsumexp_values
This is memory-efficient because it never builds a $[\text{batch} \times \text{seq} \times \text{vocab}]$ probability tensor. It only touches one logit per position plus the logsumexp scalar.
Called in trainer.py → get_logprobs(), defined in utils.py → entropy_from_logits()
Entropy tells you how uncertain the model was overall. It measures how spread out the model's output distribution is at a given position; it does not tell you whether that confidence is deserved. If the model puts all probability on one token, entropy is near zero. If it spreads probability evenly across many tokens, entropy is high. The formula is Shannon entropy:
$$H = -\sum_i p_i \log p_i$$This is computed from the same softmax output. In code, it works in chunks for memory efficiency:
logps = F.log_softmax(chunk, dim=-1) chunk_entropy = -(torch.exp(logps) * logps).sum(-1)
The trainer logs entropy as a monitoring metric. If entropy collapses during training, the model has become too deterministic and stopped exploring, which is a sign of instability. If entropy stays high, the model is not converging. Healthy training usually shows entropy gradually decreasing as the model becomes more confident about good actions.
Computed in trainer.py → compute_loss() as mismatch_kl
KL divergence tells you how far one policy has moved from another. Where entropy measures how spread out one distribution is, KL divergence measures how different two distributions are from each other. In RL post-training, it measures how far the current policy has drifted from a reference or earlier policy:
$$D_{\text{KL}}(p \| q) = \sum_i p_i \log \frac{p_i}{q_i}$$Full KL divergence would require computing softmax over the entire vocabulary under both policies at every position — extremely expensive. The trainer does not do this. Instead, it uses a per-token approximation: it takes the log-prob of the sampled token under both the current and the inference policy, computes the ratio between them, and applies the approximation $e^r - r - 1$:
# importance ratio in log space — only for the sampled token log_importance_ratio = trainer_logprobs - inference_logprobs # per-token KL approximation (not full-distribution KL) mismatch_kl = torch.exp(log_importance_ratio) - log_importance_ratio - 1
This is not the full distribution-level KL. It only asks: for this specific token that was chosen, how much did the probability change? It is a lightweight proxy — if even the sampled token's probability has shifted drastically, the full distribution has probably shifted a lot too. The approximation $e^r - r - 1$ is always non-negative, equals zero when the two policies agree, and penalizes large deviations in either direction regardless of sign. Together with importance ratio clipping (masking when the ratio falls outside $[0.125,\, 8.0]$), it acts as the leash that keeps the policy from drifting too far in a single update.