LLM Limbo: Quantising Gemma 4 to Bits and Pieces

June 2026

Gemma-4FP8AWQGPTQHQQ

Quick Brief

Experiment: We took quantised versions of Google’s Gemma 4 and ran them through two tasks – reading invoices as text, and as images – then pushed quantisation further to see how far Gemma 4 can be crushed before it breaks.
Why it matters: Inference is where AI gets expensive. Quantisation – running models in lower precision so they fit on smaller, cheaper hardware – is the standard way to bring the GPU bill down. The pitch (shrink the model by 75%, lose nothing) is appealing and therefore needs testing.
Key finding: Compression had no measurable effect on simple tasks like reading text, even at three bits per weight. Complex tasks – reading from images – degraded much faster. The economic upside is not just faster inference but cheaper kit: the same model either fits on smaller, cheaper hardware, or packs many more parallel jobs onto a single H100.

Context

It has become a cliché to say AI is going commodity. Clichés earn their reputation for a reason.

A commodity is what you get when every supplier’s product is basically the same, the price drifts toward marginal cost, customers stop caring who made it, and competition migrates from the shop floor to the back office. The textbook example is electricity: a kilowatt-hour is a kilowatt-hour, nobody pays extra for artisanal electricity, and the customer remembers only the bill.

Language models are heading the same way. The quality gap between providers shrinks each quarter, and at the same time agentic pipelines are breaking the work into many small chained calls, which means no single call has to be brilliant when fifty more follow it. Both pressures push in the same direction: what providers compete on becomes cost per call, latency, and throughput. Welcome to the meter.

Every utility, eventually, has its efficiency moment. For electricity it was the LED bulb: same brightness, a tenth of the watts, identical fitting. The room stayed exactly as bright; electricity consumption quietly dropped by a factor of ten.

For language models, the equivalent has been sitting in plain sight for years. It is called rounding.

Matrix Multiplications All the Way Down

At its core, AI is matrix multiplication. Strip away the chatbot interface, the helpful personality, the carefully tuned refusals and the trillion words of training data, and at the bottom you find one operation, done a lot.

From the outside it feels like magic. Up close, it is high-school linear algebra: the input gets encoded as a list of numbers, multiplied by a matrix of trained weights, fed into the next matrix, and so on for anywhere from a few dozen layers to several hundred. Done at industrial scale, this pedestrian operation has shown an unreasonable effectiveness at imitating the written word.

Tweet by @forloopcodes: 'i still can't believe this is your virtual girlfriend' over a 3x3 matrix multiplication

High-school linear algebra, in production.

The funny thing about modern AI chips is that they are not really constrained by maths any more. An H100 can multiply numbers together at frankly irresponsible speeds. The problem is getting the numbers to the chip quickly enough.

Picture a kitchen where the chefs can chop a thousand carrots a second, but the runners delivering carrots arrive one at a time. The chefs are fast; the runners are the bottleneck, so the chefs spend most of their day with knives in hand, waiting for the next carrot to arrive. This limitation is known as memory wall. The H100, currently the de facto industry standard, shifts roughly three trillion bytes a second across its memory bus – a great deal of carrot, still not enough to keep the chefs busy.

Quantisation helps because the chunks of carrot being delivered get smaller: each runner now carries more per trip. A 1024-bit memory bus can carry sixty-four 16-bit weights per cycle, twice as many 8-bit ones, or four times as many 4-bit ones. Same lane, smaller parcels, more throughput.

Rounding numbers is straightforward. Rounding them without breaking the network is a small academic field of its own. The flavour we care about here is weight quantisation – rounding the model’s parameters after training is done. The field splits into a zoo of acronyms (GPTQ, AWQ, HQQ, K-quants, and a dozen more) – introduced each as it shows up.

A motivated AI engineer could quantise a model from scratch. The good ones don’t re-invent the wheel. They go to the wheel shop – HuggingFace, in this trade – and on the subject of quantised Gemma 4, it is well stocked. Where it isn’t, we’ll make the wheel ourselves. That bridge for later.

Setting the Bar

David Hasselhoff is unlikely to feature prominently in the history of artificial intelligence. He should. Decades before ChatGPT, he was already holding extended conversations with KITT, the artificially intelligent Pontiac Trans Am from Knight Rider. Hasselhoff was, by some margin, the more replaceable component of the partnership.

He also recorded what may retrospectively qualify as the first mainstream song about weight quantisation: the 1991 single “Do the Limbo Dance”. Admittedly this requires a slightly aggressive reading of the lyrics. But once seen, it cannot be unseen: confronted with a low bar held by two strangers, the Hoff argues that the only correct response is to bend backwards and pass under it. The lower the bar, the better the dancer.

David Hasselhoff, Do the Limbo Dance single cover

The Hoff, visibly thrilled at the prospect of activation-aware per-channel weight quantisation with group-wise scaling factors.

The dancer, when it comes to quantisation, is the model. The bar is the number of bits assigned to each weight. “How low can you go?” turns out to be one of the central operational questions in modern AI: every reduction in precision makes models cheaper to run, easier to fit into memory and faster to serve. Lower the bar too far and the dancer hits the floor – or the model stops working, which in this metaphor amounts to the same thing.

The wheel shop – HuggingFace, again – sells more than wheels. It also stocks datasets, and on this particular shelf we knew what to look for, because we had stocked it ourselves: the 200 synthetic invoices from a previous experiment, each with a known correct total, forty deliberately broken. Each invoice goes in twice – once as plain text, once as an image. Two questions every time. What is the total? Do the numbers add up?

The prompt was intentionally austere: no chain-of-thought, no system preamble, just the document and the question. The point was to take everything else out of the equation and watch what compression alone does to an almost insultingly simple task.

The Dancers

Our model of choice is Google’s new kid on the block: Gemma 4. The family comes in four sizes – a 31B flagship, a 26B Mixture-of-Experts sibling, a 4.5B laptop model, and a 2.3B phone-sized featherweight. We left the MoE for another day and stuck to the three dense models.

All members of the family ship with senses bolted on – a vision encoder and an audio tower, on top of what was, until recently, just a next-word predictor. Open-weight phone-sized models that read pictures and process audio are a recent development; not long ago this lived only behind closed APIs. More on the eyes shortly.

Back at the wheel shop, the 31B aisle was well stocked: BF16 from Google, FP8 from Red Hat, AWQ Q4 from QuantTrio. One of each into the trolley. The 4.5B aisle had a community-maintained 4-bit GPTQ build from an individual contributor; that one in too. The 2.3B model had no under-four-bit options – we would have to make that wheel ourselves, in a moment.

By 2026 standards the 31B is modest – frontier closed-source models are an order of magnitude bigger – but 59 GiB of weights does not fit on a 16 GB MacBook Air. So we had to resort to our Neocloud of choice, Scaleway, to rent out a single Nvidia H100 for €2.73 an hour (served via vLLM in an official Docker image, because pip refused to cooperate). The bar is set. The heavyweight walks up to it first.

Round One: The Heavyweight Bends

Welcome to the warm-up round. A 31-billion-parameter language model is about to read a total off a synthetic invoice – not exactly frontier reasoning, somewhere between asking what is two plus two and what colour is the sky. This is a sanity check, before we start crushing things.

INV-2026-0042 from Umbrella Corporation. In unrelated news, Milla Jovovich, formerly in the killing-zombies line of work, has lately moved into open-source AI.

The bar is set at sixteen bits – full height, no bending required. The model arrives at full precision (BF16, the format Google shipped it in). Two hundred invoices in, two hundred correct totals out. 100 per cent, at 1.10 seconds per invoice. A spectator at the back points out the model is not actually quantised yet, and returns to their drink.

The bar drops to eight bits. To clear it, the model is compressed using FP8 – every weight now stored in half the digits it had before. Uniform rounding, no favourites. 100 per cent, now at 0.77 seconds per invoice. Polite applause.

Bar drops to four bits. The method this time is AWQ – same idea as FP8, but with a twist: a brief calibration pass picks out the weights that matter most and rounds those less harshly than the rest. 100 per cent, in 0.60 seconds. Sanity check complete. Warm-up over.

The quantisation literature is full of elegant curves showing graceful accuracy degradation as precision falls. Ours resembled a ruler laid flat on the page. On a task this simple, fewer bits bought faster inference and nothing else.

The seats are filling up. The next contender approaches the bar – same model, same precision, but this time it has to see the invoice rather than read it. That a language model can see at all is the interesting bit now.

Round Two: The Models Have Eyes

One of the things that makes Gemma 4 special is that it can also see and listen. A language model, as the name suggests, only processes text – pictures have to be converted into something it can read first.

The trick is to bolt on a translator. A vision encoder – a separate neural network – chops the image into patches and converts each into a vector of numbers; a small adapter hands those vectors to the language model in the format it uses for words. The model never sees a picture; it reads a description of the picture, written by an encoder with opinions of its own about what it just looked at. The whole pipeline now depends on that encoder being reliable. If it misreads a digit, the language model has no way to know. The definition of garbage in, garbage out.

Back to the limbo floor. Same 31B, same H100, same prompt – invoices arriving as PNGs instead of plain text. The bar is back at full height.

A new contestant approaches. The dancer reaches the bar. And – oh, what’s this? The dancer touches the bar! A first! The dancer keeps going, clears it: 90.5 per cent at 1.15 seconds per invoice. One hundred and eighty-one read correctly, nineteen missed, before we have touched a single bit.

The replays roll on what went wrong: thousands separators going missing, negative signs falling off, refunds posting as charges. The encoder is fumbling small details; the language model passes whatever it gets along.

Bar drops to eight bits, same FP8 build as before. The score goes up to 91.5 per cent – surprising, but well within noise range. Polite, almost suspicious applause.

Bar drops to four bits. The crowd leans in.

0 per cent. The dancer reaches the bar; the bar comes crashing down.

The four-bit AWQ build, it turns out, does not actually support Gemma 4’s vision encoder. Feed it an image and the multimodal pipeline quietly returns nothing useful. The kind of failure a benchmark catches in an afternoon and a production deployment finds three weeks after launch.

Halfway through the card, the scoreboard looks like this:

Model	Bits	Method	What it does	Text	Vision
Gemma 4 31B	16	BF16	Full precision – the model as Google shipped it	100%	90.5%
Gemma 4 31B	8	FP8	Half the bits per weight, pre-rounded by Red Hat	100%	91.5%
Gemma 4 31B	4	AWQ	Quarter of the bits, with a calibration pass to protect the important weights	100%	0%

Three precisions, two modalities, the same H100 and the same prompt. The text column has not moved. The vision column has either held or collapsed.

Time to shrink the dancer.

Round Three: The Mid-Card

A smaller dancer takes the floor. The E4B – 4.5 billion parameters, Google’s “laptop class,” a sixth the size of the 31B. Same architecture, same vision encoder, just less of everything.

Reminder, for anyone tuning in halfway through. A smaller model and a quantised one are two different ways of shrinking. The smaller model has fewer weights; the quantised model has the same weights stored in fewer bits. Both reduce the freight on the memory bus; only the smaller model also reduces the model’s capacity to think.

Two precisions in the bag for the E4B – only sixteen and four bits, nothing in between on the shelf. The four-bit build is GPTQ, from an individual contributor on HuggingFace. GPTQ and AWQ are the same family – both round each weight down to one of sixteen values, both use a small calibration set. The differences between them are largely academic, but for the curious: AWQ identifies the most important weights up front and protects them. GPTQ rounds the model one layer at a time, then quietly adjusts the next layer to make up for whatever the previous layer’s rounding got wrong.

Bar at sixteen bits, text. 99.5 per cent, at 0.21 seconds per invoice. Down to four bits, still 99.0 per cent at the same speed. A model six times smaller than the heavyweight, quantised to a quarter of its disk size, reading invoices almost indistinguishably from the heavyweight at full precision. The announcer suppresses a yawn.

Invoices switch to pictures. Sixteen bits: 60.5 per cent. A thirty-point gap to the heavyweight on the same images. Four bits: 38.5 per cent. A further twenty-two points off the table for one bit removed. The Q4 build did not destroy the vision pipeline the way the AWQ 31B did. It just made it cumulatively, dispiritingly worse.

Two dancers down. The scoreboard:

Model	Bits	Method	What it does	Text	Vision
Gemma 4 31B	16	BF16	Full precision – the model as Google shipped it	100%	90.5%
Gemma 4 31B	8	FP8	Half the bits per weight, pre-rounded by Red Hat	100%	91.5%
Gemma 4 31B	4	AWQ	Quarter of the bits, with a calibration pass to protect the important weights	100%	0%
Gemma 4 E4B (4.5B)	16	BF16	Full precision	99.5%	60.5%
Gemma 4 E4B (4.5B)	4	GPTQ	Quarter of the bits, rounded layer by layer	99.0%	38.5%

A pattern is crystallising. Compression is gentle on easy tasks and ruinous on complex ones.

Round Four: Off the Shelves

By now the announcer has a question for the room: how far can we shrink a model before it loses the ability to read at all? The off-the-shelf precisions stopped breaking text three rounds ago. To find the floor, we need to keep going.

The smallest dancer takes the floor: the E2B, 2.3 billion parameters, designed for a phone or a low-end laptop GPU. The wheel shop stocked only the full-precision BF16. Nothing quantised. So this round is short.

Sixteen bits, text: 100 per cent, at 0.18 seconds per invoice. Indistinguishable from the heavyweight fourteen times its size. Even the featherweight clears the bar without breaking stride.

Sixteen bits, picture: 12 per cent. Twenty-four invoices out of two hundred. At 2.3-billion-parameter capacity the encoder cannot see properly. Vision appears to be the first capability to leave when you shrink the model. Below a certain size you stop getting a multimodal model and start getting a language model with a hallucinating intern bolted to the side.

The bar cannot go any lower in this room. To find where text reading actually breaks, we head back to the garage and weld a smaller wheel ourselves.

Finding the Floor

Garage doors close. The sanctioned event is over. The rules in here are whatever we can make work.

Before the crushing, the modding. The eyes and ears are dead weight by now – the vision encoder couldn’t read invoices at full precision (12 per cent), the audio tower was never part of the experiment, and both turn out to break the quantisation tool’s memory budget. So we open the chassis and rip them out: 1,411 tensors removed, 600 kept. A text-only language model: 2.3 billion parameters, no eyes, no ears.

It took longer than anticipated. Several tools failed – AutoGPTQ wouldn’t compile, GPTQModel crashed on Gemma 4’s odd attention shapes, and so forth. HQQ on the stripped E2B was the one that finally worked. Patience, as it turns out, is the AI engineer’s first virtue.

We picked HQQ: same family as AWQ and GPTQ, but calibration-free. Where the others use a small dataset to learn which weights deserve protection, HQQ skips that step and picks the rounding from a mathematical formula that looks only at the spread of values inside each weight matrix. Quicker, but blinder about which weights actually matter.

And… it worked! At three bits per weight, the stripped E2B fits in 5.95 GB of VRAM, generates valid JSON, and reads 91 per cent of invoices correctly. Not a hundred, but functional. The entire language model – 2.3 billion parameters of compressed weights – fits in the GPU memory of a six-year-old gaming laptop.

As far as we can tell, no one else has published a Gemma 4 this small that still reads. So, not without pride, we uploaded ours to HuggingFace. For anyone who eventually wants Google’s flagship multimodal architecture running on a six-year-old gaming laptop.

The crowd is fired up now. Can the dancer go lower? We set the bar at two bits per weight. We send the dancer through. And – OH.

Two bits. 5.68 GB of VRAM. Parse rate, 0 per cent. Random tokens. No JSON, no coherent English, no structure of any kind. Same architecture, same dataset, same prompt, with one bit per weight less – and the lights went out. Not “somewhat worse.” Not “noisy.” Off. The boundary between functional and dead is exactly one bit wide.

The dancer is on the floor. The bar is being held by no one. We withstood the temptation to upload the 2-bit version to HuggingFace as well, on the grounds that a model returning noise on every input would serve no one. So, with the floor located, on to what it all means.

The Verdict

The scoreboard, with the dust settled:

Model	Bits	Method	What it does	Text	Vision
Gemma 4 31B	16	BF16	Full precision	100%	90.5%
Gemma 4 31B	8	FP8	Half the bits per weight	100%	91.5%
Gemma 4 31B	4	AWQ	Quarter bits, calibrated	100%	0%
Gemma 4 E4B (4.5B)	16	BF16	Full precision	99.5%	60.5%
Gemma 4 E4B (4.5B)	4	GPTQ	Quarter bits, layer-wise	99.0%	38.5%
Gemma 4 E2B (2.3B)	16	BF16	Full precision	100%	12%
Gemma 4 E2B (2.3B)	3	HQQ	Three bits, calibration-free, eyes and ears removed	91%	–
Gemma 4 E2B (2.3B)	2	HQQ	Two bits – dead	0%	–

Same 200 invoices, same H100, same prompt.

Read it bottom to top: two bits is the cliff, three bits the practical floor for text. Above four bits, text reading is bulletproof across every model size; vision is a different story, holding for some builds and collapsing for others.

So what does quantisation change? Not the answers on a task this simple – reading a total off an invoice barely moved across precisions. Output quality on harder tasks: not tested here. What it changes is the economics of the underlying hardware – in two directions at once: the same model now fits on smaller, cheaper kit, or packs many more parallel jobs onto the same high-end chip.

Model	Precision	Model Weights	Free Memory	Parallel Sessions
Gemma 4 31B	BF16	58.9 GiB	8.2 GiB	1.2×
Gemma 4 31B	Q8 (FP8)	31.5 GiB	35.7 GiB	5.2×
Gemma 4 31B	Q4 (AWQ)	20.3 GiB	46.8 GiB	6.8×
Gemma 4 E4B (4.5B)	Q4 (GPTQ)	9.8 GiB	57.4 GiB	130.9×

At full precision the 31B fills 58.9 of the H100’s 80 gigabytes – barely room for one session. Compressed to four bits, the same model fills 20.3, leaving 46.8 for everything else. The H100 itself has not become faster. It has become a vastly larger room. Same chip, same model, same answers on this task – roughly seven times more capacity to run things in parallel. The compressed E4B pushes that into the high two-digits and beyond.

Or, read the other way: the same compressed model now also fits comfortably on far cheaper hardware that could not have run it at full precision in the first place.

Three acts, three models, one experiment. The shape of it, in three observations.

What This Suggests

On reading-style tasks, a heavily compressed model is probably fine. Across every size and every precision down to three bits, every model in the experiment still read the invoice. The output side – whether the model still writes as well, reasons as well, hallucinates no more often – is harder to measure objectively, and we did not test that here.

Hard tasks degrade fastest under compression. Reading a number off a page is easy. Reading it off a picture is harder. Reading a picture and checking the maths is harder still. Compression takes that hierarchy at its word: easy tasks last, hard tasks first.

A multimodal model is, primarily, still a text model. No Gemma 4 read images as well as it read text. Hardly a surprise: a language model is built for text; the vision encoder is the part bolted on later. If a task can be reduced to text, reduce it to text.

So What

The takeaway for anyone running models in production is simple. The task determines the kit, not the other way round.

An easy job – reading a number out of text, classifying, extracting a field – runs on a 2.3-billion-parameter model crushed to three bits, the kind that fits on a phone. A medium job – the same number, off a picture – needs a bigger model at full precision. A hard job – reasoning across documents, catching subtle errors – needs the biggest model on the rack and a human in the loop.

Quantisation does not change that hierarchy. It can save hardware in two directions: the same model now runs on cheaper kit, or the high-end chip has room to spare for many more parallel jobs. The point is not to shrink the big model. It is to send each task to the smallest model that still does it.

The bar has dropped about as low as it can. The dancer is still standing at three bits per weight – two is too low, and there is nothing in between. So the question of How low can you go? turns out to be exactly three bits. The Hoff smiles affirmatively and limbos off into the sunset.

Licensed CC BY 4.0. Quote it, excerpt it, build on it – credit the author and link back.

← All notes