On Agents: Sufficiently Powerful, Necessarily Economical, Occasionally Correct

July 2026

Qwen3:8BFable-5MCPOllama

Quick Brief

  • Experiment: Nvidia claims small language models are the future of agentic AI – in a position paper, without an experiment. So we ran one: a cloud LLM and a local SLM drove the same agent loop against the Austrian company registry, on missions for which a script had already computed the answers.
  • Why it matters: Anyone building an agent has to pick a model, and the smaller ones are cheaper. The question that matters is whether anything breaks when you rely on them.
  • Key finding: The big model did everything right, every time. The small one understood every task just as well, and stumbled only on the mechanical part: getting the answer out. Small models can do the work, but only with more architecture and a tighter workflow wrapped around them.

Context

The model wars are over. Long live the agent wars.

Not long ago, your choice of model was practically a confession of faith. Then the models caught up with one another and became – cliché alert – a commodity. The fog of war on the model front has lifted, and what it reveals is… that the front has merely shifted. To the agentic front.

The route there ran through the early AI workplace – in retrospect, the scene of a remarkable allocation of talent. Expensively educated professionals spent their days copying information into ChatGPT and pasting the output somewhere else – where another expensively educated professional would promptly copy it into ChatGPT again.

At some point, someone noticed that the humans in this workflow were contributing little beyond Ctrl-C and Ctrl-V. If one model's output can simply be passed to another model, why keep the middlemen? Silicon Valley, which has never met a middleman it didn't want to cut out, responded with characteristic enthusiasm. That enthusiasm is now a venture-backed product category: agents.

Businesses are sold, Argentina is sold even harder: its Senate is reviewing a new corporate category in Argentine law, the non-human corporation – entities operated by AI agents or robots, limited liability included. Human shareholders may participate, but are not required.

Unsurprisingly, the AI industry's arms supplier has views on all of this too. Surprisingly, not the ones you would expect. Small Language Models are the Future of Agentic AI, declared Nvidia's research arm in June 2025. SLMs, the paper argues, are "sufficiently powerful, inherently more suitable, and necessarily more economical" for most of what agents do. Not bigger models, smaller ones. From the people selling the chips.

Header of Nvidia's arXiv paper: Small Language Models are the Future of Agentic AI

Your dealer thinks you should cut back.

Since their paper explicitly calls for contributions and critique, we decided to do our part. One last job for the middlemen, then.

Theatre of Operations

An agent is the continuation of a language model by other means. Those other means being: a loop.

In fairness to the hype, the model receives not only a prompt but also a menu of tools it may call as it sees fit. Based on the task, it decides which calls to make – and whatever a tool returns is fed straight back into the model as fresh input. This repeats, call after call, until the model considers the job from the original prompt done, and stops.

Inside the loop, however, still sits a language model – large or small – and it does what language models do: it answers. To test those answers properly, you need questions whose answers are objectively right or wrong, gradable by a script rather than by impression. "Summarise this nicely" cannot be scored. What is needed is something verifiable – right or wrong, nothing in between.

Our go-to dataset, EU regulations, would not do this time: laws can be classified, but hardly graded to the cent. We needed numbers – balance sheets, ratios, birth years – wrapped in just enough words for natural-language questions to bite. Enter the Austrian company registry.

Austria keeps a meticulous inventory of its companies, the Firmenbuch, and every Kapitalgesellschaft – every company whose owners enjoy limited liability – must file its annual accounts there, by law. Better still: Brussels has declared company registers "high-value datasets" and ordered them opened up. The registry now answers machine queries free of charge.

For our experiment, we do not need the whole register. We kept the active GmbHs with at least three consecutive years of accounts, the latest filed in 2025: in total 33,282 companies, converted from the ministry's XML into JSON and loaded into a cloud database, with an MCP server bolted on top so that a language model can access it. Each company is one document: the balance sheet, the income statement where the law requires one, the ratios we compute from both, and the people who sign – role and birth year.

Should you wish to interrogate the Austrian economy yourself: be our guest. Agents get the MCP endpoint, humans get buttons. Free of charge, for as long as our Claude token budget holds.

Screenshot of the Firmenbuch goes Agentic web interface

Designed for humans, inspired by agents.

The theatre is set. Time to inspect the arsenal.

The Arsenal

A good soldier is not judged on enthusiasm but on whether the hill got taken. The same goes for financial analysts, with the hills traded in for spreadsheets: what counts is not the confidence of the delivery, but whether the numbers are right. The agents in this experiment will be judged like analysts.

So the test is an analyst's working day, compressed: questions of escalating difficulty – look up one number, then filter and rank. Each one comes with an answer key, computed in advance.

The answer key comes from a plain Python script. No loop, no tools, no opinions – it skips the MCP entirely (MCPs are for clankers) and asks the database directly, in the database's own language. It costs nothing, answers in under a second, and has no concept of ambiguity. Its one weakness: somebody has to write a new one for every question. That somebody, of course, is the middleman.

First assignment, deliberately the simplest we could write: name the 50 oldest Geschäftsführer in the database. No filters, no ratios, one sort – the registry records each signer's birth year, the script orders by it and takes the top 50. Done, in 0.69 seconds.

A few data points from the list. The ages run from 88 to 100, median 90.5; at the top, born 1926 and still the registered signatory, FN 065005x. (Companies appear here by Firmenbuch number only. Whether GDPR requires this, we do not know – we obey in advance.) One wrinkle for the grading: a crowd of 88-year-olds ties at the bottom of the list, so the grader does not demand one exact set of FN numbers – it checks that all 39 aged 89-plus are present, that every remaining seat holds a genuine 88-year-old, and that the count is 50.

Now the other side. An agent needs three things. A system prompt, fixing its behaviour: answer only from the data, companies by number only, admit what you could not verify. A user prompt: the question. And the ability to call tools – which, in our case, means one thing: an MCP server.

An MCP server is really just an interpreter between two parties that cannot talk to each other. Models speak only text; databases speak only queries. MCP turns the model's worded request into a real query, and the rows back into text the model can read. That is the whole job – and the whole point: build the interpreter once, and anything that writes text can use the database. Flagship or featherweight. We are about to send both.

The agent loop: model, MCP server as interpreter, registry behind

The whole trick, drawn, slightly simplified. The model thinks, the MCP translates, the loop repeats – and the bill grows at every arrow.

So much for the theory. Time to put it into practice.

First Engagement

The flagship goes first: Claude Fable 5 – the dumbed-down but publicly available edition of the model deemed too dangerous to be released.

(Update: yes, we really did run this in the seventy-two hours Fable 5 was public.)

It did not fumble once. One tool call per run – search by GF age, descending, limit 50: exactly the query an analyst would have written – then the answer. We asked three times; all 39 of the 89-plus club present every time, every remaining seat a genuine 88-year-old, and the three lists identical, company for company. The brief asked every model for caveats; the flagship actually filled the box, flagging the silently excluded companies without a recorded age and the tie at the bottom of the list. The simplest task of all, passed with flying colours – in 19 seconds and at $0.35 per run.

That, however, was never Nvidia's claim. Their claim concerns the featherweights. So the same mission went to Qwen3:8B, an 8-billion-parameter open-weight model, running locally on the MacBook Air – the consumer-device deployment Nvidia's paper has in mind.

The first local run took nine minutes, against the flagship's nineteen seconds. Expected. Then the answer came back, and we checked how many of the fifty companies had made it in.

None. Zero out of fifty.

The tool call was identical. The data was identical. The culprit was a default. Ollama, the program serving local models, rations a model's working memory, and the 13,000-token delivery flushed the rules and the question clean out of it. The model, left holding data with no brief, confidently delivered something anyway.

First lesson of the campaign: out-of-the-box gear is not field-ready. One setting later the problem was gone – though fifty slim rows had just filled a third of this model's head, and real databases serve heavier meals. Nvidia's prospectus is silent on portion control. We have a thought or two on it – more on that later.

First, the rematch. This time the data fit. The database had done all the sorting, the MCP had handed over the final, correct list – the model's entire job was reading it back, line by line. Which went fine for ten lines. Then the needle stuck:

..."137249m","281973t","216248m","238144m",
"137249m","281973t","216248m","238144m",
"137249m","281973t","216248m","238144m", ...

Row eleven, forever.

A language model, you see, never copies – it re-writes, predicting every next word from what it sees. The flagship has capacity to spare for bookkeeping: row fourteen, row fifteen, row sixteen. The featherweight has no such luxury; halfway down a list of look-alike codes it loses its place and grabs the nearest pattern, which is the line it just wrote. Not a thinking problem – there was barely anything to think. A writing problem.

Before anyone nominates us for a Turing Award: the stutter has been in the books since 2019, as neural text degeneration. The obvious workaround – let the server write the list into a file and have the model merely hand over the link – is a change to the tools, not to the model.

The first mission's books: the flagship answered three times – correctly and identically each time, in 19 seconds, at $0.35 a run. The featherweight answered five times in all – once on factory settings, four times with the memory fixed – never correctly, never the same way twice; every run after the fix started right and derailed at a different line. Free of charge, admittedly – but at up to 27 minutes a run. Necessarily more economical, as promised. Sufficiently powerful is another matter.

Lesson learned, workaround filed – and one question left open: what does the featherweight do when the writing is within its powers? Mission two was designed to find out.

Second Engagement

The laundromat has acquired almost mythical status in finance-bro folklore, displacing the Goldman internship as the be-all and end-all of financial ambition. The appeal of the asset class of choice rests on recurring cash, minimal inventory, little labour, and customers who not only do the work themselves but pay for the privilege.

Regrettably, the Austrian company registry has no category for laundromats. So we went for the next best thing: a machinery builder in the industrial heartland, run by a Geschäftsführer of advanced years. Mission two, then: "Which machinery companies in Oberösterreich have a Geschäftsführer aged 70 or older?"

Is this a pigeon meme: finance bro mistakes a database query for a laundromat

Fuzzy matching.

For the experiment, the question is convenient twice over: three filters to compose instead of one sort – and an answer short enough that the featherweight's pen should survive it.

The flagship went first. It asked, the MCP answered, and there was the list: five companies, the same five sitting on our answer key – three runs, three identical answers, thirteen seconds and $0.17 apiece. No surprises expected; none granted. The five directors carry 373 years between them, and one of their companies turns €34m of revenue into a €77,000 loss. So much for the laundromat dream.

Then the chosen one – the weight class Nvidia's paper nominates for exactly this work. It took its time, four minutes of thinking out loud per run, at the usual price of nothing – but this time it finished: five codes is a list an 8B can write down. Its five matched the answer key exactly, and matched them again on the second run, and on the third. Given a question its pen survives, the featherweight is reliable.

So, for once, both analysts agreed. The difference sat in the small print underneath. Both had been asked for caveats – the brief demands them. The flagship's box held the two ways the answer could still be quietly wrong, among them the 11,841 companies sitting in "unknown", invisible to any sector search. The featherweight's box was empty. One analyst does what you asked; the other tells you what you should have asked. The $0.17 buys the second.

Two missions completed, out of the infinitely many we could have flown. Time for the after-action review.

What This Suggests

"Sufficiently powerful" survives – provided the answer is short. In our total of 14 runs, both models chose the right tool with the right parameters at the first attempt, every time. The featherweight failed the long list on every attempt, and got the short answer right every time. Same task, different length – and length made the whole difference. Small models do not fail at thinking. They fail at writing it down.

"Inherently more suitable" holds only if the tools suit the model. The featherweight's memory failure was pure configuration: the runtime's default working memory was simply too small for the data, and it failed silently – no error, no warning. The stutter has known countermeasures too – a repetition guard we left untested, and the simpler cure of not making a model photocopy at all: keep results short, and for long lists have the server save a file and let the model return just the link. Small models need tools that carry the heavy text for them.

"Necessarily more economical" depends on how much you run it. The small model costs almost nothing per answer – ours ran on already-sunk costs. But someone has to build the workflow around it, size the memory, and notice when it fails silently. That human time is the real bill: run the agent a hundred times an hour and it vanishes into the volume; run it once a day and you've paid an engineer to save thirty-five cents. The cheap model isn't free – it just moves the cost from the invoice to the payroll.

The triad is missing its fourth line: the expensive model ships an audit trail. Both models got the answer right, every time. Only the flagship also told you where it might be wrong – the missing ages, the 11,841 unclassified companies – caveats that read like an auditor's working notes. When the right answer is cheap, that warning is what you're actually paying for.

So What?

The models, it turns out, were the least interesting part. Both of them – the large one and the small – translated plain-language questions into correct database queries fourteen times out of fourteen; the intelligence was a given. What separated success from failure was the workflow around the model – how much data flows in, how long the answer must be, which default somebody forgot to change.

Which is why, having set out to test it, we end up with nothing that challenges the general path Nvidia's paper outlines. Small models may well be the future of agentic AI – provided their tools stay simple, their inputs stay small, and the architecture around them does the heavy lifting. Those are engineering details, not objections. The future, as so often, ships with some assembly required.

One thing worth noting: everything we tested had a provably right answer. That is precisely the work agents will take over first – the lookups, the filters, the lists. Whether they can do the work that has no answer key, this correspondence cannot say. The nearest we came was the caveats themselves: we asked each model where its own answer might mislead – the one question with no right answer – and the flagship had plenty to say while the featherweight had nothing.

As for everyone whose working day is lookups, filters and lists: time to finally buy that laundromat.