Code of the Day
BeginnerWhat is AI?

How large language models work

LLMs predict the next token — that one mechanism, repeated at scale, produces everything you see.

Using AIBeginner9 min read
Recommended first
By the end of this lesson you will be able to:
  • Explain what a token is and why models think in tokens, not words
  • Describe the training process at a conceptual level
  • Articulate why LLMs produce fluent text that can still be factually wrong
  • Explain what temperature controls and why it matters to you as a user

When you send a message to an AI chat interface and a reply streams back, word by word, you are watching one mechanism run over and over: predict the next token, add it to the sequence, repeat. That is genuinely all it is doing. The sophistication comes from doing it at enormous scale, with a model trained on a vast slice of human writing. Understanding that mechanism — not at a mathematical level, but at a conceptual one — will change how you read AI output.

Tokens, not words

LLMs do not process text as words. They process tokens, which are roughly syllable-sized chunks determined statistically during training. The word "programming" might be one token. The word "extraordinarily" might be split into several. Common short words are single tokens; rare words and non-English text are split more aggressively.

Why does this matter to you?

  • Counting is unreliable. When you ask an AI to "write exactly 200 words," it is not counting words — it is generating tokens. The result will be approximate.
  • Typos and odd formatting can confuse things. An unusual string of characters may tokenise in an unexpected way, throwing off the model.
  • Context limits are in tokens. Most model limits you read about — "this model has a 128k context window" — are token counts, not word counts. A rough rule of thumb: one token is about three-quarters of a word in English.

Training: learning from prediction errors

The training process is conceptually simple. Take a huge corpus of text — books, web pages, code, conversations — and repeatedly ask the model to predict what comes next. Compare the model's prediction to what actually comes next. Adjust the model's internal weights slightly to make the correct continuation a little more likely. Repeat billions of times.

The result is a model with hundreds of billions of numerical weights that encode, implicitly, an enormous amount of statistical structure about how language works: grammar, facts, idioms, argument patterns, code syntax, and much more. The model did not memorise this structure; it learned it because that structure helped it predict the training data.

The weights are not a database of facts. They are a compressed statistical summary of patterns in training data. This is why the model can write a convincing paragraph about a topic it has never been explicitly "told" about — and why it can also write a convincing paragraph that is subtly or catastrophically wrong.

Why fluent text is not the same as correct text

Here is the key consequence of that training process: the model optimises for plausibility, not truth. During training, it is rewarded for predicting text that matches what actually follows in human writing — and humans write plausible, fluent text far more than they write technically correct text. The model learns that fluent continuations are rewarded.

This creates the failure mode called hallucination: the model generates confident-sounding, grammatically perfect text that asserts facts that are simply wrong. It is not lying; it has no concept of truth. It is doing exactly what it was trained to do — producing a plausible continuation — and sometimes the most plausible continuation happens to be false.

The practical implication is crucial: fluency is not evidence of correctness. A response that reads smoothly and sounds authoritative might contain invented citations, incorrect dates, non-existent library functions, or subtly wrong logic. You have to verify on a different axis entirely — by checking against sources, running code, or applying your own domain knowledge.

Temperature: controlling randomness

When the model predicts the next token, it does not always pick the single most probable one. It samples from a probability distribution over possible continuations. Temperature is the parameter that controls how that sampling works:

  • Low temperature (close to 0): the model almost always picks the most probable token — output is deterministic and predictable, but can feel repetitive.
  • High temperature (closer to 1 or above): the model samples more broadly — output is more varied and sometimes more creative, but also more likely to drift into incoherence.

Most chat interfaces expose this indirectly through settings like "more creative" vs "more precise." You do not usually tune it directly, but knowing it exists explains why running the same prompt twice can give different answers, and why asking the model to "be more creative" can also make it less accurate.

When you need reliable, repeatable output — a code function with specific behaviour, a structured summary — lower creative settings produce fewer surprises. Reserve high-temperature settings for genuinely open-ended brainstorming.

Check your understanding

Knowledge check

  1. 1.
    What is a token in the context of a large language model?
  2. 2.
    A fluent, confident-sounding AI response is more likely to be factually correct than a hesitant one.
  3. 3.
    What does a higher temperature setting do to LLM output?

Where to go next

You know what tokens are, how models learn, and why fluency does not equal accuracy. Next: what AI is actually good at — and not so good at — which maps those mechanisms directly onto everyday tasks so you know where to lean on AI and where to double-check it.

Finished reading? Mark it complete to track your progress.

On this page