Technical Notes
Experiments, system design explorations, and technical observations from building AI systems.
-
Moneyball AI: Running Agents on a Raspberry Pi
July 2026 · #09
Nvidia – which sells the big chips – says agents mostly need small ones, and last time we came away half-convinced. The open question was how small you can actually go. So we drafted a bench of sub-billion-parameter models, wired them to the Austrian company registry through an MCP server, and set the cheapest loose on a Raspberry Pi.
→ Read full note
A half-billion-parameter model ran the whole pipeline – picked the tools, chained them, produced the right file, counts identical to a control script – and broke at exactly one step: writing the answer down.
Everything worked but the confirmation message. -
On Agents: Sufficiently Powerful, Necessarily Economical, Occasionally Correct
July 2026 · #08
Nvidia, vendor of the very large chips that run very large models, has declared that AI agents mostly need small ones. The declaration is a position paper: arguments, references, no experiment.
→ Read full note
So we ran one. A frontier model and an 8-billion-parameter laptop model drove the same agent loop against the Austrian company registry. The flagship: flawless, at a price. The featherweight: free, willing, and able to understand every task – yet defeated by the simplest mission we could write, before acing a harder one.
Nothing failed at thinking. Everything failed in the plumbing. -
LLM Limbo: Quantising Gemma 4 to Bits and Pieces
June 2026 · #07
Quantisation is rounding. Round a model's weights to fewer bits and one of two things happens: the same model now fits on smaller, cheaper hardware, or the high-end GPU has room to spare for many more parallel jobs. Either way, the bill drops.
→ Read full note
Google's Gemma 4 – a new open-weight model that reads text but also processes pictures and audio – was put through progressively lower precisions across three sizes, from a 31-billion-parameter flagship down to a 2.3-billion-parameter phone-sized featherweight, to find where compression starts to break things.
Text reading held all the way down to three bits per weight. Vision broke first; at two bits, the model produced random tokens and nothing more. Compression is gentle on easy tasks and ruinous on hard ones – the trick is matching the model to the job, not the other way round. -
Too Dangerous to Release: When LLMs Meet Invoices, or Why SaaS Will Survive
May 2026 · #06
Five open-weight language models – from 8B to 70B parameters, four architectures, hardware bills from free to €23 per hour – were tested on two hundred synthetic invoices with cent-perfect ground truth.
→ Read full note
The best model read the stated total correctly 95% of the time – but when the invoice itself was wrong, no model corrected more than half. Reasoning models performed worse than plain models at every size. Neither scale nor thinking closed the gap.
Building it yourself gets you 80% of the way there. The last 20% is what the enterprise software licence actually pays for – and it is not your core business. -
Pimp My LM: A Fine-Tuning Tale of Bling and Basic
April 2026 · #05
Three language models were fine-tuned on 64,000 EU regulations to classify legislation into 21 thematic domains. Two BERT models ran on a free GPU. One Llama 8B ran on eight Nvidia H100s.
→ Read full note
All three scored essentially the same – the best results in the series so far. The difference was the bill: less than €10 for the small ones, €83 for the large one. An order of magnitude more expensive for the same result.
Also, the European Parliament’s official classifier for EU regulations does not work. We built three that do. -
Who Even Needs Nvidia? Classifying EU Laws Without a GPU
April 2026 · #04
The AI industry has spent considerable effort establishing that text classification is a job for large language models. The models are impressive, the hardware is expensive, and the results – it turns out – are not.
→ Read full note
We tested TF-IDF – a method old enough to vote, with no neural network and no understanding of language whatsoever – against four LLMs on 890 EU regulations. It outperformed the best of them by 24 percentage points.
The method that counts words beat the method that supposedly understands them. Which raises an uncomfortable question about what we have all been paying for. -
Regulation Radar: What Four LLMs Made of 890 EU Laws
March 2026 · #03
The EU published 890 pieces of binding legislation in six months – over five million words of regulations, decisions, and directives. We pointed four language models at the pile and checked their classifications against the human librarians who have been tagging EU law since 1995.
→ Read full note
Not one regulation was classified identically by all four models. A 70B reasoning model that pauses to think before answering outperformed a 141B legal specialist trained specifically on law. The biggest model was not the best, and the most confident were not the most correct.
But when the models did agree with each other, they tended to agree with the humans too – which turned out to be the more interesting finding. -
Better Call Saul(LM): Do Bigger Models Actually Agree on Ambiguous Documents?
March 2026 · #02
Six open-source LLMs were asked to classify four ambiguous legal documents as contract or not contract. Same prompt. Same temperature. Same documents.
→ Read full note
One model disagreed with itself on successive runs. The bigger models did not agree more. They disagreed differently.
Scaling up did not resolve the disagreement, but a legal-specialist model came closest to getting it right. -
When LLMs Disagree: Testing Local Models on Contract Classification
February 2026 · #01
The nice thing about document classification is that it sounds simple.
→ Read full note
The less nice thing is that it becomes considerably less simple the moment you hand the same legal document to three different open-source LLMs running on a laptop – and get two different answers back.
All three models were confident. One said yes, two said no.
The interesting part isn't which one was right. It's what this kind of disagreement tells you about how AI systems actually need to be built.