What External Auditors Should Understand About How Large Language Models Actually Work

Over the past few months I’ve gone down a deep rabbit hole trying to understand the current AI hype cycle in auditing.

Everywhere you look, AI is being inserted into professional workflows.

AI test of details
AI financial statement generation
AI anomaly detection

Some people believe AI will soon be performing large parts of the audit.

As someone building software used by auditors, I wanted to understand something simple:

How do these systems actually work?

So I spent time reading research papers from AI labs, listening to lectures from professors studying these models, and trying to understand the underlying mechanics. I went to the source here. No marketing materials from the company selling you the AI. Remember these companies spend billions of dollars a year on marketing to make you believe AI solves every problem.

What I learned surprised me. In fact, it completely changed how I think about AI.

The mental model I had before was wrong. I assumed these systems were performing some kind of structured reasoning something similar to how a professional would solve a problem.

That isn’t really what’s happening. At their core, large language models are doing something much simpler and much stranger. They generate language by repeatedly sampling probabilities.

Or put more simply:

They are rolling a set of extremely sophisticated weighted dice.

And that raises a question every audit partner should probably ask.

Are we comfortable rolling the dice on audit evidence?

Once you understand that idea, the strengths—and limitations—of modern AI become much clearer.

The Dice Model of AI

Imagine a long table covered with dice.

But instead of numbers, each side of the die contains accounting or audit terms.

(Photo credit – AI 🙂)

One die might look like this:

Another might look like this:

Another might look like this:

Each die represents a topic or context.

Now imagine two important things:

There are millions of these dice, covering nearly every topic imaginable.
The dice are weighted, meaning some words are far more likely than others.

For example, when auditors are discussing cash, the word reconciliation is far more likely to appear than inventory.

So what happens inside the model?

Each time you ask a question, the system is essentially rolling one of these weighted dice to generate the next word.

If the probabilities are well-trained, the output will often look correct.

But it is still—at its core—a probabilistic process.

It is still a roll of the dice.

What Happens When You Ask AI a Question

Suppose you ask an AI tool:

“Auditors should verify that cash…”

The model looks at the context and selects a probability distribution that resembles accounting language.

Conceptually, it picks something like the cash die:

It then “rolls” the die to generate the next word.

Maybe the sentence becomes:

“Auditors should verify that cash balances reconcile to the bank statement.”

Now the context has changed.

The sentence now includes words like:

auditors | verify | cash | balances | reconcile

So the model selects another probability distribution and rolls again.

Maybe the sentence continues:

“…and investigate any reconciling items identified.”

This process repeats one word at a time until the response is finished.

Conceptually it looks like this:

Prompt

↓

Choose probability distribution

↓

Generate next word

↓

Add word to context

↓

Repeat

Each word is another probabilistic draw.

That’s the roll of the dice.

What the Model Is Actually Doing

Modern language models are trained on enormous datasets—essentially a large portion of the internet.

During training, the model learns statistical relationships between words and phrases.

At runtime, it repeatedly solves one problem:

“Given everything written so far, what word is most likely to come next?”

This process is called next-token prediction.

The system is not retrieving answers from a database.

It is generating the most statistically likely continuation of language.

Surprisingly, many tasks can be approximated this way:

drafting emails
summarizing documents and extracting terms
answering questions
generating explanations
writing narratives

But this also explains some of the strange behavior people observe when using AI.

Sometimes it performs brilliantly.

Sometimes it fails at tasks that seem simple.

Researchers often refer to this as jagged intelligence.

A model might solve complex reasoning tasks while struggling with something basic like counting letters in a word.

Why?

Because the model is fundamentally matching patterns in language, not executing formal logic.

The “Reasoning” Illusion. Marketing at its finest

Many AI tools now show their work.

They produce step-by-step explanations that look like reasoning:

Step 1: Identify the relevant balances

Step 2: Compare to the bank statement

Step 3: Investigate reconciling items

Step 4: Conclude reconciliation accuracy

To a reader, this looks exactly like thinking.

This is often presented as evidence that AI is “reasoning.”

But the research community increasingly understands something important:

These explanations often aren’t the reasoning process.

They are generated explanations after the answer is already produced.

In other words, the model can produce a convincing explanation even when the underlying answer came from statistical pattern matching.

This is why some researchers describe the process as rationalization rather than reasoning.

The explanation may sound logical.

But it is not necessarily the true cause of the answer.

Yet it is critical for professions that rely on traceable reasoning and defensible conclusions.

Why This Matters for Auditors

Understanding how these systems work helps clarify where they are useful.

AI performs well for tasks that involve language and interpretation, such as:

summarizing documents
drafting client communications
organizing notes
identifying themes in large text datasets

But many core audit tasks require something very different.

Audit work frequently demands outputs that are:

exact
reproducible
traceable
defensible

Examples include:

reconciliations
transaction matching
completeness testing
tie-outs between workpapers

These tasks depend on deterministic logic, not probabilistic generation.

In other words, auditors can experiment with probabilistic tools but the parts of the audit that support the opinion still need deterministic systems.

The Future Isn’t AI or Software — It’s Both

One mistake people make in the current AI discussion is assuming the future is purely AI-driven.

A more realistic architecture for professional systems looks like this:

AI assists interpretation

summarizing documents
explaining anomalies
drafting narratives
organizing information

Deterministic software verifies results

reconciliations
data validation
completeness checks
workflow controls

Professionals apply judgment

investigating exceptions
evaluating evidence
making final conclusions

Each layer solves a different type of problem.

Understanding the role of each is how firms will use AI safely and effectively.

Want to Go Deeper?

One of the clearest explanations of these ideas comes from this lecture:

https://www.youtube.com/watch?v=ShusuVq32hc

It walks through:

how language models predict tokens
why they sometimes fail at simple tasks
why chain-of-thought explanations can be misleading

If you're curious how these systems actually work, it’s worth the watch.

Final Thought

Large language models are remarkable tools.

But at their core, they are probability engines trained to generate language.

Most of the time, the probability distribution points toward a good answer.

Sometimes it doesn’t.

That’s the key idea that changed how I think about AI:

Every response is, in some sense, another roll of the dice.

Understanding that doesn’t make AI less useful.

But it does help us use it more intelligently.

‍

AI Is Rolling the Dice On Your Audit.