Technical Notes

Experiments, system design explorations, and technical observations from building AI systems.

Too Dangerous to Release: When LLMs Meet Invoices, or Why SaaS Will Survive

May 2026 · #06

Five open-weight language models – from 8B to 70B parameters, four architectures, hardware bills from free to €23 per hour – were tested on two hundred synthetic invoices with cent-perfect ground truth.

The best model read the stated total correctly 95% of the time – but when the invoice itself was wrong, no model corrected more than half. Reasoning models performed worse than plain models at every size. Neither scale nor thinking closed the gap.

Building it yourself gets you 80% of the way there. The last 20% is what the enterprise software licence actually pays for – and it is not your core business.
→ Read full note
Llama-3.1:8BQwen3:8BGemma-4:31BQwQ:32BLlama-3.3:70B
Pimp My LM: A Fine-Tuning Tale of Bling and Basic

April 2026 · #05

Three language models were fine-tuned on 64,000 EU regulations to classify legislation into 21 thematic domains. Two BERT models ran on a free GPU. One Llama 8B ran on eight Nvidia H100s.

All three scored essentially the same – the best results in the series so far. The difference was the bill: less than €10 for the small ones, €83 for the large one. An order of magnitude more expensive for the same result.

Also, the European Parliament’s official classifier for EU regulations does not work. We built three that do.
→ Read full note
BERTEUBERTLlama-3.1:8BQLoRAAccelerate
Who Even Needs Nvidia? Classifying EU Laws Without a GPU

April 2026 · #04

The AI industry has spent considerable effort establishing that text classification is a job for large language models. The models are impressive, the hardware is expensive, and the results – it turns out – are not.

We tested TF-IDF – a method old enough to vote, with no neural network and no understanding of language whatsoever – against four LLMs on 890 EU regulations. It outperformed the best of them by 24 percentage points.

The method that counts words beat the method that supposedly understands them. Which raises an uncomfortable question about what we have all been paying for.
→ Read full note
TF-IDFLogisticRegressionFastText
Regulation Radar: What Four LLMs Made of 890 EU Laws

March 2026 · #03

The EU published 890 pieces of binding legislation in six months – over five million words of regulations, decisions, and directives. We pointed four language models at the pile and checked their classifications against the human librarians who have been tagging EU law since 1995.

Not one regulation was classified identically by all four models. A 70B reasoning model that pauses to think before answering outperformed a 141B legal specialist trained specifically on law. The biggest model was not the best, and the most confident were not the most correct.

But when the models did agree with each other, they tended to agree with the humans too – which turned out to be the more interesting finding.
→ Read full note
EUR-LexAnthropic APIEuroLLM:22BSaulLM:141BDeepSeek-R1:671B
Better Call Saul(LM): Do Bigger Models Actually Agree on Ambiguous Documents?

March 2026 · #02

Six open-source LLMs were asked to classify four ambiguous legal documents as contract or not contract. Same prompt. Same temperature. Same documents.

One model disagreed with itself on successive runs. The bigger models did not agree more. They disagreed differently.

Scaling up did not resolve the disagreement, but a legal-specialist model came closest to getting it right.
→ Read full note
ScalewayvLLMSaulLM:54BLlama-3.1:70BQwen2.5:72B
When LLMs Disagree: Testing Local Models on Contract Classification

February 2026 · #01

The nice thing about document classification is that it sounds simple.

The less nice thing is that it becomes considerably less simple the moment you hand the same legal document to three different open-source LLMs running on a laptop – and get two different answers back.

All three models were confident. One said yes, two said no.

The interesting part isn't which one was right. It's what this kind of disagreement tells you about how AI systems actually need to be built.
→ Read full note
PythonOllamaLlama-3.1:8BMistral-NeMo:12BQwen2.5:14B

Technical Notes

Too Dangerous to Release: When LLMs Meet Invoices, or Why SaaS Will Survive

Pimp My LM: A Fine-Tuning Tale of Bling and Basic

Who Even Needs Nvidia? Classifying EU Laws Without a GPU

Regulation Radar: What Four LLMs Made of 890 EU Laws

Better Call Saul(LM): Do Bigger Models Actually Agree on Ambiguous Documents?

When LLMs Disagree: Testing Local Models on Contract Classification