Too Dangerous to Release: When LLMs Meet Invoices, or Why SaaS Will Survive

May 2026

Llama-3.1:8BQwen3:8BGemma-4:31BQwQ:32BLlama-3.3:70B

Quick Brief

  • Experiment: Five open-weight language models were tested on two hundred synthetic invoices with cent-perfect ground truth. Forty invoices contained deliberate arithmetic errors.
  • Why it matters: Invoice processing is the use case every AI pitch deck opens with. If a language model and a prompt can replace a €300-per-seat SaaS subscription, the business case writes itself. The numbers are either right or wrong, and the distance between right and wrong can be measured to the cent.
  • Key finding: Building it yourself gets you 80% of the way there. The last 20% is what the enterprise software licence actually pays for – and it is not your core business. The models can read invoices almost perfectly – the best extracted the stated total correctly 95% of the time – but no model corrected more than half of the invoices where the total itself was wrong.

Context

Last month, Anthropic declared its new flagship model, Claude Mythos, too dangerous for public release. Declaring your own language model too dangerous to be released is a tradition in the industry, dating back to at least GPT-2.

Quite what constitutes “dangerous,” however, remains unclear. Unreliable performance on basic arithmetic? Difficulty converting a PDF into usable text? The notorious inability to count the number of “r”s in the word “strawberry”? We would argue: yes!

In other words, we suspect the models too dangerous to be released have already been released. They were simply not marketed that way. And the danger, when it arrives, will not look like danger at all.

Think less Skynet and more triple-A Mortgage-Backed Securities: models producing convenient answers, accepted uncritically by their operators, and used as the basis for downstream decisions. In the case of the financial models, the numbers added up to a world no one had ever lived in. When reality finally corrected the error, the losses were not just theoretical.

Language models have finally democratised this problem. The triple-A rating now ships in every API call.

“I could do that.”

Like the five stages of grief, some patterns of human psychology are so universal they deserve their own framework. The Self-Delusion Cycle is a drama in four acts.

Phase I: Observational Superiority. The conviction, arrived at effortlessly and without evidence, that the thing being observed is simple. The less you know about how it works, the stronger the conviction. Expertise is, at this stage, a handicap – it introduces doubt, which slows down the confident formation of opinions.

Phase II: Euphoria. A rabbit hole opens. Twelve browser tabs. Several YouTube tutorials by, in hindsight, questionable authorities. A growing suspicion that everyone who has been doing this professionally has been overcomplicating things. Opinions form – strong, unsolicited, and unearned.

Phase III: Plateau. The firm opinion meets a not-so-edgy edge case. A second rabbit hole opens, less enjoyable than the first. Contradictions appear, and are quietly suppressed.

Phase IV: Strategic Deferral. The contradictions pile up. The rabbit holes stop being fun. The enthusiasm that carried Phase II has quietly left the building. The project is not abandoned – it is reprioritised. The prototype joins a distinguished archive of things that could absolutely be finished, but other things have become more important.

When it comes to enterprise software, boardroom executives have collectively entered Phase I. Needless to say, we joined them. What follows is what happened when we kept going.

The Fascinating World of Invoices

Invoice processing. A market so large and so boring that you only need to capture 0.01% of it to thoroughly revolutionise the industry. Everybody knows what an invoice is. Every company has a workflow for them, manual or automated. Read a document, extract the numbers, check the maths. A use case so obvious it barely qualifies as an experiment.

It also happens to be the perfect test case. Unlike classification or fine-tuning, there is nothing to argue about. Five units at €1,819.56 is either €9,097.80 or it is not. The number is right or wrong, and the distance between right and wrong can be measured to the cent.

The problem is that real invoices are confidential. No company hands its accounts payable folder to a stranger with a laptop. So we built our own. Two hundred synthetic invoices, generated from scratch, each with a cent-perfect ground truth and a known set of failure modes baked in.

Sample invoice from Dunder Mifflin Europe GmbH billing for photocopier toner and AI Strategy workshops INV-2026-0075 – Dunder Mifflin Europe GmbH. The stated total is off by €3,159. Complaints to the Assistant to the Regional Manager.

The things that make real invoices difficult are mundane, which is precisely why we built the corpus around them. Three number formats: English (1,000.00), German (1.000,00), Swiss (1’000.00), because European accounting has its dialects. Four ways to phrase VAT. Five ways to phrase a discount. Credit notes with negative totals. Reverse-charge mechanisms. Mixed VAT rates.

On top of that, forty invoices contained deliberate arithmetic errors, to see whether the model checks the maths or just nods along. Real-world error rates are lower – but nobody stress-tests their accountant by only sending them correct invoices.

Every invoice is a plain text Markdown file. No PDFs, no scans, no OCR. The model receives the invoice as language – the one thing it is supposed to be good at. If it fails, it is not because a scanner mangled a pixel. It is because the model is failing at the one thing it was built to do.

Each model got two prompts. The first – autopilot – asks the model to do everything: read the invoice, do the maths, report a total. The second – hybrid – asks it to only read, and passes the extracted numbers to a Python script that handles the arithmetic. The obvious assumption: let the model read, let code calculate. Safer, surely.

Phase I – Nine Out of Ten

An invoice is unstructured text with numbers in it. Reading unstructured text is, at least nominally, what language models do.

The evergreen of open-source AI is Meta’s Llama 3.1 8B Instruct. Eight billion parameters, served locally with Ollama, the model that shows up in every tutorial, every conference demo, every breathless LinkedIn post. Free to download, free to run, free to point at two hundred invoices. No cloud, no bill, no permission from IT.

The only constraint worth noting: Meta’s licence requires permission once your product exceeds seven hundred million monthly active users – a threshold we felt comfortable we would not breach on our first afternoon, but definitely one to keep in mind going forward.

This is the Phase I version of enterprise AI: a free model, consumer hardware, and the conviction that the expensive infrastructure everyone else uses is mostly a racket. Two hundred invoices. Twenty-five seconds each. The expensive hardware might buy you speed, but the payment term on an invoice is thirty days.

Open the CSV. Sort by exact_match. Top of the column: 172 out of 200.

Llama 8B
Produced an answer 99%
Read stated total correctly*
(of 200 invoices)
86%
Flagged a broken invoice*
(of 40 with deliberate errors)
18%
Corrected a broken invoice*
(of 40 with deliberate errors)
0%
Time per invoice 25s
Hardware MacBook Air
Running cost Free
* Of invoices where the model produced an answer.

Eighty-six per cent. An afternoon of work, a free model, no training data, no fine-tuning, no domain expertise – and the model read the stated total correctly on nearly nine invoices out of ten. Well, that was easy. We are the Steve Jobs of invoicing.

Phase II beckoned. At some point the question is no longer whether to sell to SAP – it is whether to buy them.

At closer look, however, a quieter realisation. The autopilot test was rigged in the model’s favour – the invoice states its total in plain text, at the bottom, in bold. The model could simply read it and report it back. An open-book exam with the answer on every page.

And that is exactly what it did. On the 160 invoices where the stated total was correct, Llama read it back with 86% accuracy. On the 40 where the total was deliberately wrong, it read the wrong number back just as reliably – 82%. It did not check the arithmetic. It copied what it saw.

The hybrid pipeline, where the model extracts ten individual fields and hands them to code, fared even worse. One misread unit price, one wrong discount, and the script computes a confidently wrong total. Good maths on bad inputs. Of forty deliberately broken invoices, the model flagged seven – worse than a coin flip.

Still, 86% read correctly. Nobody expects a prototype to be perfect – that is, by definition, what the next iteration is for.

Phase II – Surely, Thinking Helps

The next iteration already exists. Reasoning models – models that think before they answer, show their working, pause and reconsider. If the problem is arithmetic, a model built to reason should be the fix.

The go-to choice in the open-source reasoning space is Alibaba’s Qwen3-8B. Eight billion parameters, same weight class as Llama, built-in thinking mode. Thinking, as it turns out, takes time. On the laptop, that meant four minutes per invoice – potentially twenty-seven hours for the full run.

A hardware upgrade, then – but not the whole circus. A single Nvidia H100 rented from Scaleway at €2.73 per hour, served with vLLM. Thirty-two seconds per invoice. The results were downloaded and analysed. Then analysed again. Forty-three per cent of responses were unparseable.

On 43% of invoices, the model opened a <think> tag and never closed it. No JSON. No answer. Just reasoning that trailed off into nothing. Two independent runs confirmed: the model reasoned itself into silence.

Of the responses that did come back, however:

Llama 8B
(plain)
Qwen 8B
(reasoning)
Produced an answer 99% 57%
Read stated total correctly*
(of 200 invoices)
86% 58%
Flagged a broken invoice*
(of 40 with deliberate errors)
18% 45%
Corrected a broken invoice*
(of 40 with deliberate errors)
0% 5%
Time per invoice 25s 32s
Hardware MacBook Air 1× H100
Running cost Free €2.73/hr
* Of invoices where the model produced an answer.

Even on the 57% of invoices where it produced an answer, it read the stated total correctly just 58% of the time – worse than Llama’s 86%. The reasoning model thinks harder but reads worse. A model that is more accurate on the occasions it functions is not a more accurate model. It is the talented colleague who does not show up to work.

On invoices where VAT was already included in the stated price, Qwen got the total right 10% of the time. The model reasoned about whether to add or subtract VAT – and almost always chose wrong. Thinking, it appeared, was not the same as understanding.

Phase III – Surely, Bigger Helps

Two models in, one lesson learned: reasoning is expensive, slow, and only sometimes helpful. If reasoning does not help, maybe raw capacity will. The H100 was already running.

Strictly speaking, the H100 was unnecessary. Google’s Gemma 4 31B – thirty-one billion parameters, four times Llama’s capacity – fits in twenty gigabytes. A capable laptop could handle it. But we had a rented GPU, and shutting it down felt like admitting something.

No reasoning mode. No thinking tags. No drama. Nineteen seconds per invoice – the fastest yet. Parse rate: 100%. Every single invoice returned valid JSON. Just answers. Google, it appeared, had recovered from its ChatGPT moment. Whether the invoice industry would recover from us remained to be seen.

The leaderboard was beginning to tell a story:

Llama 8B
(plain)
Qwen 8B
(reasoning)
Gemma 31B
(plain)
Produced an answer 99% 57% 100%
Read stated total correctly*
(of 200 invoices)
86% 58% 95%
Flagged a broken invoice*
(of 40 with deliberate errors)
18% 45% 82%
Corrected a broken invoice*
(of 40 with deliberate errors)
0% 5% 18%
Time per invoice 25s 32s 19s
Hardware MacBook Air 1× H100 1× H100
Running cost Free €2.73/hr €2.73/hr
* Of invoices where the model produced an answer.

It read the stated total correctly on 95% of invoices – the highest of any model. On broken invoices, it flagged 82% as inconsistent – but only corrected 18%. It could tell something was wrong. It could not tell you what was right. The boring model that just works, except when the invoice does not.

The jump from Llama to Gemma – four times the parameters, a rented GPU – had improved reading from 86% to 95%. But correction went from 0% to 18%. The gap between reading an invoice and checking one was beginning to look permanent. Phase III was no longer approaching. It was here.

Phase III, Continued – Surely, Both Help

The reasoning hypothesis did not die easily. Reasoning did not work at 8B – fine, the model was too small. Surely, at thirty-two billion parameters, a purpose-built mathematical reasoning model would close the gap. The candidate was Alibaba’s QwQ-32B – same weight class as Gemma, same hardware, but this one thinks.

Once again, thinking takes time. In this case, one hundred and three seconds per invoice – eleven and a half hours for the full run. A problem for the pilot phase, not the prototype.

Eighty-one per cent produced an answer – an improvement on Qwen’s 57%, but still nowhere near a plain model’s 99%. What made matters worse: it read the stated total correctly on 58% of invoices – same as Qwen 8B, a model one quarter its size, from a different architecture entirely, despite sharing a vendor.

Two minutes of reasoning per invoice, at five times the cost, producing worse results than the model that answered in nineteen seconds. Twenty-two invoices came back at €0.00 – the model reasoned its way to zero on invoices worth six figures.

Llama 8B
(plain)
Qwen 8B
(reasoning)
Gemma 31B
(plain)
QwQ 32B
(reasoning)
Produced an answer 99% 57% 100% 81%
Read stated total correctly*
(of 200 invoices)
86% 58% 95% 58%
Flagged a broken invoice*
(of 40 with deliberate errors)
18% 45% 82% 38%
Corrected a broken invoice*
(of 40 with deliberate errors)
0% 5% 18% 0%
Time per invoice 25s 32s 19s 103s
Hardware MacBook Air 1× H100 1× H100 1× H100
Running cost Free €2.73/hr €2.73/hr €2.73/hr
* Of invoices where the model produced an answer.

The contradictions were piling up, and they were getting harder to suppress. Phase III, by now, was undeniable. The smaller reasoning model read the stated total correctly on 58% of invoices. So did the bigger one. Four times the parameters, five times the inference time, identical reading accuracy.

We had engineered the prompts, fixed the parsers, adapted the code. All of it amounted to the equivalent of adding “make no mistakes” to a prompt – and the model made mistakes anyway. The thinking model thought for five times longer and got worse results. The boring model won everything. None of this was in the plan.

Phase IV – “Just One More Model, Bro”

The experiment was supposed to be over. But the thought persists – it always does – that maybe you just have not thrown enough parameters at the problem.

From a previous experiment, we still had a quota for an eight-GPU cluster at €23 per hour. The obvious candidate – the model we had already benchmarked in a previous article – was Meta’s Llama 3.3 70B. Nine times larger than the model we started with. Same setup. Same prompts. Same invoices.

Three seconds per invoice. Perhaps more silicon really was the answer after all.

Then we opened the results. Reading accuracy: 70%. Worse than Gemma’s 95%. Nine times the parameters. Eight times the running cost. Except in one respect: Llama corrected 48% of broken invoices, the highest correction rate of any model. It read worse but checked more. A trade-off nobody had asked for.

Five models, four architectures, parameter counts spanning an order of magnitude, hardware bills ranging from free to €23 per hour.

We pulled up the final table:

Llama 8B
(plain)
Qwen 8B
(reasoning)
Gemma 31B
(plain)
QwQ 32B
(reasoning)
Llama 70B
(plain)
Produced an answer 99% 57% 100% 81% 99%
Read stated total correctly*
(of 200 invoices)
86% 58% 95% 58% 70%
Flagged a broken invoice*
(of 40 with deliberate errors)
18% 45% 82% 38% 75%
Corrected a broken invoice*
(of 40 with deliberate errors)
0% 5% 18% 0% 48%
Time per invoice 25s 32s 19s 103s 3s
Hardware MacBook Air 1× H100 1× H100 1× H100 8× H100
Running cost Free €2.73/hr €2.73/hr €2.73/hr €23/hr
* Of invoices where the model produced an answer.

The answer was unambiguous, and it was not the one we wanted. The best reader (Gemma, 95%) corrected fewer than one in five broken invoices. The best corrector (Llama 70B, 48%) could not read the stated total on three invoices out of ten. No model did both well. €23 per hour and eight GPUs did not buy a better reader than a single GPU at €2.73 – just a better checker that read worse. Not good enough for any production workflow, not even close.

The local dream was over – even the laptop model needed cloud hardware to run a reasoning variant. We could probably get there – with fine-tuning, validation layers, human review – but not like this, and not today.

Three patterns had emerged, none of them the ones we expected. Reasoning does not help – at either size, the thinking model performed worse than the plain one. Size does not help past a point – at least in this case – 70 billion parameters on eight GPUs lost to 31 billion on one. And the hybrid approach – let the model read, let code calculate – only works if the model can read every field correctly. One misread unit price, and the code computes a confidently wrong total.

What This Suggests

Building it is not impossible. Dozens of vendors have done it – and claim 99% accuracy doing so. What is not visible from the outside is everything they had to build around the model: validation layers, exception routing, human review, and the kind of operational infrastructure that nobody maintains unless their core product depends on it.

The gap is what the licence pays for. Validation rules that catch the misread total before it posts. Audit trails. Liability. Compliance. And someone to call at 11pm on the last day of the month, when 170 invoices have come back wrong and month-end close does not balance. The “I can build this myself” project reads the invoice. It does not check it. That gap is the product.

In complex environments, complexity compounds. A discount expressed as a percentage breaks the extraction pipeline. A German comma drops three orders of magnitude. A reasoning model returns zero on a six-figure invoice. Each edge case is mundane on its own. Together, they are the reason enterprise software costs what it does – and why it probably will not go away.

So What

In 2008, no individual mortgage was the problem. The problem was that thousands of them were rated by models nobody had stress-tested, packaged into instruments nobody fully understood, and sold to investors who trusted the label. The models were confident. The ratings were wrong. The losses compounded.

Language models will not cause a financial crisis. But the architecture of the failure is the same: a confident output, accepted without verification, compounding through every downstream system that trusts the output before it. The risk does not scale with the technology. It scales with adoption. Every invoice tool, contract reader, and reconciliation bot that goes into production without validation layers and human review is another node in a system that nobody is stress-testing.

The companies most exposed are the ones least equipped to notice: small enterprises who looked at the €300-per-seat licence, looked at the API, and made the decision to build instead of buy. They will not have the operational infrastructure to catch the errors. They will have the confidence of a model that never hesitates.

The models too dangerous to be released were never the ones that got withheld. They are the ones that produce plausible output, used for downstream decisions, unchecked.

As for our invoice processor – Phase IV had arrived. The project was not abandoned – it was reprioritised. Other things had become more important. Apparently quantisation was the next big thing. We started reading. A browser tab opened.