Large Language Model Post-Training With Math, Some Illustrations, and in Code
This exemplar is my attempt to work through and understand reinforcement-learning post-training for large language models. We will go through some of the underlying math, look at a few illustrations, and study how these ideas appear in code through the RL trainer Verifiers-RL.
I Logit, Softmax, and Logprobs
When an LLM generates text token by token, (self-feeding, autoregressive$p(y_1, \ldots, y_T \mid x) = \prod_{t=1}^{T} p(y_t \mid x, y_{<t})$
The model reads the entire prefix (prompt + tokens generated so far) and outputs one logit per vocabulary token for the next position. Pick one token, append it to the prefix, repeat. Each next token is conditional on all previous tokens, including the ones just generated.) its neural network produces a score for every entry in its vocabulary at each step. These scores are called logits. The vocabulary here is the model’s fixed tableThe vocabulary is a fixed lookup table used by the tokenizer. In one direction, you give it text and it returns token IDs. Those IDs are the numbers fed into the network. In the other direction, the model produces scores over all token IDs, one gets picked, and the tokenizer maps that ID back into text. The model only ever touches the numbers, not the text itself. Text exists on both sides of the model, but never inside the neural network. that maps token IDs to token strings. The model works with the IDs, and once one is selected, the tokenizer maps it back into the text token you see.
A higher score means the network favors that token more strongly as the next one, while a lower score means it favors it less. The token you finally see depends not only on these scores but also on the decoding rule used to choose from them. If the decoding rule is argmax, the token produced will simply be the one with the highest score.
The output of the neural network is just a vector of scores as you see by the logits vector there. The index i in that vector corresponds to entry i in the vocabulary table, which maps IDs to tokens.
Now, if we could directly adjust this scoreboard of logits, we could change what token comes out next. But that is not the same as changing the model itself. We can already change the visible output at decoding time by changing the decoding rule, for example by sampling instead of taking argmax. In that case the output changes, but the model has not. The policy, in the training sense, is still the same. RL post-training does not aim to change the decoder or manually alter logits after they are produced. It aims to change the weights so that the model produces different logits on its own.
But logits are still only raw scores. By themselves, they do not tell us in a clear or interpretable way how strongly the model prefers one token over another. If one token has a logit of 12.4 and another 11.9, it is not obvious what that gap really means. Softmax is the step that turns this raw scoreboard into something readable: a probability distribution over the vocabulary.
Mathematically, the logit vector z passes through softmax to produce probabilities, and taking the logarithm of those probabilities gives logprobs. These logprobs are always between minus infinity and 0, and they are the quantities that appear in RL training formulas. They tell us how likely the model considered the tokens it produced, and they provide the terms through which training adjusts the weights.
Once the logits have been turned into probabilities, and then into logprobs, the model’s output is now in the form that post-training can work with.
Softmax is the step that turns this raw scoreboard into something readable: a probability distribution over the vocabulary. Let's get some intuition for how that happens. Inspired by Elliot Waite.
1 The model produced a score for each candidate token. These raw scores are the logits, sitting on a number line. Further right means more likely, but the numbers are not probabilities yet.
2 Pass each logit through ex. This makes every value positive and amplifies differences: a small gap in logit space becomes a much larger gap after exponentiation.
3 Each logit maps to a point on the ex curve. The dashed lines show the projection: go up to the curve, read the ez value. The highest logit gets a disproportionately large value.
4 Divide each ez by the total. The result is a probability distribution: every value between 0 and 1, summing to exactly 1. That is softmax. Drag the sliders to redistribute.
Temperature is worth noting here because it operates on the logits, before softmax, not after. You divide each logit by T, then apply softmax. A high T squashes the logits closer together so softmax produces a flatter, more uniform distribution. A low T stretches the gaps so softmax becomes spikier, concentrating probability on the top token. It has to happen before softmax because if you scaled after, the probabilities would no longer sum to 1.
Softmax is a way to transform any set of numbers into a probability distribution. What's a probability distribution? A probability is the chance of something happening or being observed, like "30% chance of rain" or ".00000001% chance of winning the lottery." A probability distribution is a set of probabilities that sum to 100% (or to 1 if not scaled up by 100). For example, "30% chance of rain" is not a probability distribution, but if we'd add to the set "70% chance of no-rain" then we'd have a probability distribution.
Logprobs are the natural logarithm of the softmax output. Softmax gives you a probability between 0 and 1, and taking ln of that gives you a number between minus infinity and 0. A probability close to 1 becomes a logprob close to 0. A tiny probability becomes a hugely negative logprob. This is useful because the log stretches out the differences at the low end. Going from probability 0.001 to 0.002 is nearly invisible on a 0-to-1 scale, but in log space it is the difference between -6.9 and -6.2, a meaningful jump the gradient can act on. At the top end the opposite happens: going from 0.95 to 0.96 barely moves the logprob. So the model gets strong signal when it improves on things it is bad at, and does not waste effort squeezing tiny gains where it is already confident. This is why training works in log space.
Temperature and sampling strategy are two separate things. Temperature reshapes the distribution, it does not decide what gets picked. The sampling strategy decides how to pick from the distribution. Greedy (argmax) always picks the highest probability token. Top-k limits the selection to the k most probable tokens, then samples among those. Top-p (nucleus sampling) keeps the smallest set of tokens whose cumulative probability exceeds p, then samples from that set. Multinomial sampling picks randomly according to the full probability weights. You can combine them: temperature first to control the shape, then a sampling strategy to pick from the result.
The logprob of action $a_t$ given state $s_t$ under policy $\pi_\theta$. This is the term that appears in Policy Gradient, REINFORCE, PPO, GRPO, and every other RL training loss.
$\pi_\theta$ is the policy, the model's behavior as defined by its current weights $\theta$. It takes a state $s_t$ (the sequence so far) and returns a probability distribution over all actions (every token in the vocabulary). $\pi_\theta(a_t \mid s_t)$ is the softmax probability assigned to the token that was actually chosen at step $t$. Since softmax is a coupled function where all probabilities must sum to 1, the formulas only need this one value. Adjusting the probability of the chosen token automatically redistributes probability across the entire vocabulary. That is why $\log \pi_\theta(a_t \mid s_t)$, a single logprob per token, is enough for RL to work with.
Now imagine you could manually adjust these logprobs. You look at the model's output, you see which token it picked and how confident it was, and you nudge the probability up or down. Push one logprob up, softmax redistributes, and next time the model is more likely to pick that token. Push it down, and the model avoids it. You are manually reshaping the model's behavior one logprob at a time.
But here is the thing to keep in mind: the logprob is an output, not an input. You cannot reach into the network and change a logprob directly. The logprob comes from the weights. To make a logprob go up or down, you have to change the weights that produced it. Gradient descent is the mechanism that figures out which weights to nudge and by how much so that the logprob moves in the direction you want. That is what $\nabla_\theta \log \pi_\theta(a_t \mid s_t)$ means in the formulas: the gradient of the logprob with respect to every weight in the network. It tells you the direction to push the weights so that this particular logprob changes.
So the picture is: you see a logprob you want to be higher, but you cannot touch it directly. You compute the gradient, which tells you how to adjust the weights, and the logprob changes as a consequence. Now scale that by a signal that says whether the choice was good or bad, and you have training. That signal is called the advantage.
Logit and softmax in deep learning Minsuk Heo
Softmax Function Explained In Depth with 3D Visuals Elliot Waite
Why Do Neural Networks Love the Softmax? Mutual Information
Sigmoid and SoftMax Functions in 5 minutes Gabriel Furnieles
II Advantage, Gradient, and Descent
The key insight is: logprob x advantage = the update signal. Every algorithm is a variation on that. The differences are in how they compute the advantage and how they keep the updates stable.