Over the past few months I’ve gone down a deep rabbit hole trying to understand the current AI hype cycle in auditing.
Everywhere you look, AI is being inserted into professional workflows.
Some people believe AI will soon be performing large parts of the audit.
As someone building software used by auditors, I wanted to understand something simple:
How do these systems actually work?
So I spent time reading research papers from AI labs, listening to lectures from professors studying these models, and trying to understand the underlying mechanics. I went to the source here. No marketing materials from the company selling you the AI. Remember these companies spend billions of dollars a year on marketing to make you believe AI solves every problem.
What I learned surprised me. In fact, it completely changed how I think about AI.
The mental model I had before was wrong. I assumed these systems were performing some kind of structured reasoning something similar to how a professional would solve a problem.
That isn’t really what’s happening. At their core, large language models are doing something much simpler and much stranger. They generate language by repeatedly sampling probabilities.
Or put more simply:
They are rolling a set of extremely sophisticated weighted dice.
And that raises a question every audit partner should probably ask.
Are we comfortable rolling the dice on audit evidence?
Once you understand that idea, the strengths—and limitations—of modern AI become much clearer.
Imagine a long table covered with dice.
But instead of numbers, each side of the die contains accounting or audit terms.

(Photo credit – AI 🙂)
One die might look like this:
audit | evidence | procedures | sampling | testing | documentation
Another might look like this:
cash | bank | reconciliation | statement | deposit | balance
Another might look like this:
revenue | invoice | cutoff | recognition | contract | customer
Each die represents a topic or context.
Now imagine two important things:
For example, when auditors are discussing cash, the word reconciliation is far more likely to appear than inventory.
So what happens inside the model?
Each time you ask a question, the system is essentially rolling one of these weighted dice to generate the next word.
If the probabilities are well-trained, the output will often look correct.
But it is still—at its core—a probabilistic process.
It is still a roll of the dice.
Suppose you ask an AI tool:
“Auditors should verify that cash…”
The model looks at the context and selects a probability distribution that resembles accounting language.
Conceptually, it picks something like the cash die:
cash | bank | reconciliation | statement | balance | deposit
It then “rolls” the die to generate the next word.
Maybe the sentence becomes:
“Auditors should verify that cash balances reconcile to the bank statement.”
Now the context has changed.
The sentence now includes words like:
auditors | verify | cash | balances | reconcile
So the model selects another probability distribution and rolls again.
Maybe the sentence continues:
“…and investigate any reconciling items identified.”
This process repeats one word at a time until the response is finished.
Conceptually it looks like this:
Prompt
↓
Choose probability distribution
↓
Generate next word
↓
Add word to context
↓
Repeat
Each word is another probabilistic draw.
That’s the roll of the dice.
Modern language models are trained on enormous datasets—essentially a large portion of the internet.
During training, the model learns statistical relationships between words and phrases.
At runtime, it repeatedly solves one problem:
“Given everything written so far, what word is most likely to come next?”
This process is called next-token prediction.
The system is not retrieving answers from a database.
It is generating the most statistically likely continuation of language.
Surprisingly, many tasks can be approximated this way:
But this also explains some of the strange behavior people observe when using AI.
Sometimes it performs brilliantly.
Sometimes it fails at tasks that seem simple.
Researchers often refer to this as jagged intelligence.
A model might solve complex reasoning tasks while struggling with something basic like counting letters in a word.
Why?
Because the model is fundamentally matching patterns in language, not executing formal logic.
Many AI tools now show their work.
They produce step-by-step explanations that look like reasoning:
Step 1: Identify the relevant balances
Step 2: Compare to the bank statement
Step 3: Investigate reconciling items
Step 4: Conclude reconciliation accuracy
To a reader, this looks exactly like thinking.
This is often presented as evidence that AI is “reasoning.”
But the research community increasingly understands something important:
These explanations often aren’t the reasoning process.
They are generated explanations after the answer is already produced.
In other words, the model can produce a convincing explanation even when the underlying answer came from statistical pattern matching.
This is why some researchers describe the process as rationalization rather than reasoning.
The explanation may sound logical.
But it is not necessarily the true cause of the answer.
Yet it is critical for professions that rely on traceable reasoning and defensible conclusions.
Understanding how these systems work helps clarify where they are useful.
AI performs well for tasks that involve language and interpretation, such as:
But many core audit tasks require something very different.
Audit work frequently demands outputs that are:
Examples include:
These tasks depend on deterministic logic, not probabilistic generation.
In other words, auditors can experiment with probabilistic tools but the parts of the audit that support the opinion still need deterministic systems.
One mistake people make in the current AI discussion is assuming the future is purely AI-driven.
A more realistic architecture for professional systems looks like this:
Each layer solves a different type of problem.
Understanding the role of each is how firms will use AI safely and effectively.
One of the clearest explanations of these ideas comes from this lecture:
https://www.youtube.com/watch?v=ShusuVq32hc
It walks through:
If you're curious how these systems actually work, it’s worth the watch.
Large language models are remarkable tools.
But at their core, they are probability engines trained to generate language.
Most of the time, the probability distribution points toward a good answer.
Sometimes it doesn’t.
That’s the key idea that changed how I think about AI:
Every response is, in some sense, another roll of the dice.
Understanding that doesn’t make AI less useful.
But it does help us use it more intelligently.