Technical Notes
Experiments, system design explorations, and technical observations from building AI systems.
-
Too Dangerous to Release: When LLMs Meet Invoices, or Why SaaS Will Survive
May 2026 · #06
Five open-weight language models – from 8B to 70B parameters, four architectures, hardware bills from free to €23 per hour – were tested on two hundred synthetic invoices with cent-perfect ground truth.
→ Read full note
The best model read the stated total correctly 95% of the time – but when the invoice itself was wrong, no model corrected more than half. Reasoning models performed worse than plain models at every size. Neither scale nor thinking closed the gap.
Building it yourself gets you 80% of the way there. The last 20% is what the enterprise software licence actually pays for – and it is not your core business. -
Pimp My LM: A Fine-Tuning Tale of Bling and Basic
April 2026 · #05
Three language models were fine-tuned on 64,000 EU regulations to classify legislation into 21 thematic domains. Two BERT models ran on a free GPU. One Llama 8B ran on eight Nvidia H100s.
→ Read full note
All three scored essentially the same – the best results in the series so far. The difference was the bill: less than €10 for the small ones, €83 for the large one. An order of magnitude more expensive for the same result.
Also, the European Parliament’s official classifier for EU regulations does not work. We built three that do. -
Who Even Needs Nvidia? Classifying EU Laws Without a GPU
April 2026 · #04
The AI industry has spent considerable effort establishing that text classification is a job for large language models. The models are impressive, the hardware is expensive, and the results – it turns out – are not.
→ Read full note
We tested TF-IDF – a method old enough to vote, with no neural network and no understanding of language whatsoever – against four LLMs on 890 EU regulations. It outperformed the best of them by 24 percentage points.
The method that counts words beat the method that supposedly understands them. Which raises an uncomfortable question about what we have all been paying for. -
Regulation Radar: What Four LLMs Made of 890 EU Laws
March 2026 · #03
The EU published 890 pieces of binding legislation in six months – over five million words of regulations, decisions, and directives. We pointed four language models at the pile and checked their classifications against the human librarians who have been tagging EU law since 1995.
→ Read full note
Not one regulation was classified identically by all four models. A 70B reasoning model that pauses to think before answering outperformed a 141B legal specialist trained specifically on law. The biggest model was not the best, and the most confident were not the most correct.
But when the models did agree with each other, they tended to agree with the humans too – which turned out to be the more interesting finding. -
Better Call Saul(LM): Do Bigger Models Actually Agree on Ambiguous Documents?
March 2026 · #02
Six open-source LLMs were asked to classify four ambiguous legal documents as contract or not contract. Same prompt. Same temperature. Same documents.
→ Read full note
One model disagreed with itself on successive runs. The bigger models did not agree more. They disagreed differently.
Scaling up did not resolve the disagreement, but a legal-specialist model came closest to getting it right. -
When LLMs Disagree: Testing Local Models on Contract Classification
February 2026 · #01
The nice thing about document classification is that it sounds simple.
→ Read full note
The less nice thing is that it becomes considerably less simple the moment you hand the same legal document to three different open-source LLMs running on a laptop – and get two different answers back.
All three models were confident. One said yes, two said no.
The interesting part isn't which one was right. It's what this kind of disagreement tells you about how AI systems actually need to be built.