Think of a vending machine that occasionally dispenses two items or nothing at all. Brilliant when it works, maddening when it doesn’t, and experienced users know exactly which button combination gets the best result. LLMs are a bit like that. Extraordinary tools when you understand their quirks. Frustrating black boxes when you don’t.

Whilst LLMs are great at what they’ve been trained on, they can trip up when asked to do big jobs on new data. Why? They have memory limitations.

Imagine you are the author of a 500-page novel. You ask the AI to review it. It comes back and says it’s a good piece of work, 85/100, but the middle section needs tightening to be great. “OK, be more specific, which chapters?” It reads them again and says actually, I was wrong, the middle sections are already tight. Press it further and it starts citing plot twists that never happened. Or it confuses your hero for the villain. This is called a “hallucination”, the model making things up.

So what happened? It never told you it hadn’t read the whole book. In reality it probably read the first 50 pages, the last 50 pages, and a selection in between, then guessed the rest using inference. This is called “lost in the middle”. Models naturally pay more attention to the beginning and end of what they’re given and lose focus on what’s buried in between. It delivered its review without telling you it skipped half the document. You’d never know unless you probed.

Hallucination isn’t just a memory problem. Models can sometimes hallucinate even on short, simple jobs. It’s not purely about length. But long, unstructured data makes it a lot worse. Hundreds of pages of PDFs are particularly challenging. The trick is to break the document into smaller, organised chunks rather than feeding it everything at once.

Power User Workarounds

People who push these tools hard develop a feel for their limits, a bit like learning to play an instrument. The trick is usually to break complex tasks down into smaller jobs the model can handle more reliably. Some tools take this further with planning modes that create to-do lists and spin up sub-agents to handle individual pieces, so the model doesn’t lose track of what it’s doing mid-project.

This Matters Especially in Coding

Coding is one of the most prominent use cases for LLMs right now, and it’s where the technology has advanced fastest. Agentic coding tools can now plan, write, test, and iterate on code with minimal human intervention. But bigger, more complex codebases have historically been where things fell apart. The more code a model had to hold in mind, the more mistakes crept in. That’s improved significantly over the past year, and the pace of progress continues to accelerate.

But best practice still matters. A detailed, granular spec gets you a far better result than a vague high-level brief. Break it down into discrete components you develop in different sessions. Set out your data model rather than letting it guess, and identify the best architecture based on the scale of the project before hitting go.

A Note for the IRRBB Community

If you’re analysing data or running calculations, LLMs aren’t built for numerical precision. They can and do make calculation errors, and with large datasets you’ll hit the same memory limitations we’ve talked about.

The smarter approach is to get the LLM to build you a tool to do the analysis. You’re not using the AI as a calculator. You’re using it to build you a calculator. Tools like Cursor, Codex, Claude Code, and Claude Cowork can build you a working Python application from a detailed specification. It’s like having your own development team. The Python tool handles the data and the maths reliably. The LLM just builds it. But specify it properly. Leave it vague and you’ll get a basic first draft rather than what you actually need. At the moment, tools like these are likely restricted to IT exploratory sandboxes inside most institutions. But in the longer term, there is incredible value in putting them in the hands of domain experts who understand the problems that need solving.

For smaller personal projects, the Claude Excel add-in is pretty good. Inside most banks, third-party plugins are unlikely to be approved, but for working with non-sensitive data at home it’s a useful option. But for anything complex, the Python tool approach sidesteps both the data size problem and the maths problem in one go.

This is a good example of the broader principle. Don’t push the LLM into territory it’s weak in. Use it to build tools that play to its strengths.

What’s Available Inside Banks

There are two types of models. Frontier models (Claude, Gemini, ChatGPT) are the most capable, and the ones this post has largely been describing. Then there are smaller local models that can be hosted on-premise, are cheaper to run, more secure, and don’t require data to leave the building. For processing proprietary data these will generally be the preferred choice inside financial institutions.

What’s available is also dependent on governance. Banks typically have significant approval processes around technology. This means by the time a model gets through the governance cycle, a newer, more capable version has often been released. In practice, many people inside banks are working with models that are a generation or two behind the frontier, and wondering why they’re getting unreliable results. The limitations described in this post are amplified on older, less capable models, so if things feel clunkier than you expected, that’s probably why.

Frontier models are however great for research and development. Use them on public data, non-sensitive use cases, or synthetic data. Build tooling to replace your spreadsheets. If you’re exploring what’s possible, prototyping, or trying to understand where the technology is heading, the frontier models are ideal. Just don’t feed them anything you wouldn’t want leaving the building.

It’s worth knowing which model you’re actually using. The gap between a frontier model and an approved internal model can be significant.

The Bottom Line

These tools are still capable of wild hallucinations. But if you understand their limitations and work within the quirks, the outcomes can be mind-blowing. And with every new model release, the need to apply these workarounds seems to diminish slightly, which generates a lot of excitement from those at the edge of the technology. Some well-known failures have already been resolved in frontier models. LLMs famously couldn’t count the letters in “strawberry”, struggled with basic arithmetic on larger numbers, and confused numerical comparisons like whether 9.11 is greater than 9.9. The latest models handle all of these reliably. The pace of improvement is part of the story.


Stephen Harvey is the founder of irrbb.com and Neuro-XI. For more on the operational challenges these tools might help address, see All Model, No Plumbing.