I Asked AI to Count My Carbs 27,000 Times. It Couldn’t Give Me the Same Answer Twice.

Ask ChatGPT to estimate the carbs in your lunch. Now ask it again. And again. Five hundred times.

You’d expect the same answer each time. It’s the same photo, the same model, the same question. But you won’t get the same answer. Not even close — and the differences are large enough to cause a hypoglycaemic emergency.

That’s the central finding of a study I’ve just published as a preprint, and it has direct implications for anyone using AI-powered carb counting in a diabetes app.

The study

I submitted 13 food photographs — real meals, photographed on a phone, the way you’d actually use them — to four leading AI models: OpenAI GPT-5.4, Anthropic Claude Sonnet 4.6, Google Gemini 2.5 Pro and Google Gemini 3.1 Pro Preview. Each photo was sent over 500 times to each model. Same prompt every time. Same photo. Same settings.

26,904 queries in total. All at the lowest randomness setting these models offer.

The prompt was adapted from the one used in the iAPS open-source automated insulin delivery system — it’s a real production prompt, not a toy example.

The models disagree with themselves

Every model returned different carbohydrate estimates for the same photo across repeated queries. But the degree of disagreement varies enormously.

How much does each model disagree with itself?

Each dot is one of the 13 test images. The violin shape shows the spread. Claude’s variation clusters below 5% for most images; the Gemini models regularly exceed 10-20%.

Model	Median variation (CV)	Median insulin swing	Worst-case insulin swing
Claude Sonnet 4.6	2.4%	0.9 U	13.6 U
GPT-5.4	8.4%	2.3 U	16.6 U
Gemini 3.1 Pro	10.3%	2.9 U	16.2 U
Gemini 2.5 Pro	11.0%	4.7 U	42.9 U

The worst case? The paella photo. Here’s what happened when I sent it to each model 500+ times:

Every dot is one query. Same photo. Same prompt. Same model. Gemini 2.5 Pro’s estimates span from 55g to 484g — a 429g range, equivalent to 42.9 units of insulin at a 1:10 ICR. Claude’s estimates cluster tightly by comparison.

42.9 units of insulin from a single photo. That’s not a rounding error. That’s a potential fatality.

This variation is invisible to you

When you take a photo in a diabetes app, you get one number back. You have absolutely no way to know whether you received a typical estimate or a tail-end outlier from a distribution you can’t see. For Claude, that single number is probably close to the model’s consensus. For Gemini 2.5 Pro, you could be anywhere on the map.

The cheese sandwich that
defeats AI

Here’s one that should be easy. Two slices of thick white bread (carbs on the packet: 20g per slice) plus cheddar cheese (negligible carbs). Reference value: 40g. Simple, unambiguous, packet-label accuracy.

Left: Three models independently converge on ~28g — consistently wrong by 12g. Right: GPT-5.4 estimates ~74g with high variability — wrong in the other direction by 34g. The red dashed line is the actual value.

Three of four models — Claude, Gemini 2.5 Pro and Gemini 3.1 Pro — independently converge on approximately 28g for a 40g meal. 510 queries from Claude, CV of 0.3%, and every single one is 12g below the actual value. The bread is right there in the photo. The carb value is on the packet.

This is the “precisely wrong” problem: high consistency doesn’t guarantee accuracy. A diabetes app user getting 28g every time would consistently underdose by ~1.2 units.

GPT-5.4 goes the other way: mean estimate 74g, nearly double the reference, and highly variable on top of it.

The models don’t always know what they’re looking at

I found food identification errors in 8 of the 13 test images:

Bakewell tart: Claude called it a “Linzer torte” in 100% of 510 queries. GPT-5.4 called it a “jam tart” or “cake bar.” Only Gemini 3.1 Pro correctly named it (99.8%).
Crema catalana: Three of four models called it “creme brulee” 100% of the time. Only Gemini 3.1 Pro got “crema catalana” — in 3.4% of queries.
Cheese sandwich: Gemini 3.1 Pro added non-existent “deli meat” in 17.4% of queries — hallucinating an ingredient that isn’t there. This could directly inflate carbohydrate estimates.

Some of these misidentifications have modest nutritional impact. Others could change the carbohydrate estimate substantially.

Where does your insulin dose actually land?

On the five images where I had the strongest reference values (packet labels and weighed portions), here’s how often each model’s individual queries would have pushed insulin doses into clinically dangerous territory:

Green is safe (<1U error). Yellow is moderate (1-2U). Orange is clinically significant (2-5U). Red is severe hypo risk (>5U). Claude is the only model with no queries in the orange or red zones.

Claude: 100% of queries in the safe or moderate zone. No single query would have caused more than a 2-unit insulin error.

GPT-5.4: 37% of queries would cause a clinically significant insulin error (>2U). That’s more than one in three queries landing in the danger zone.

Gemini 3.1 Pro Preview: 12% of queries would cause a clinically significant insulin error (>2U). Better than ChatGPT-5.4.

Gemini 2.5 Pro: 12% of queries would cause a >5U error — the threshold associated with severe hypoglycaemia requiring third-party assistance.

Two types of risk

The study identifies two distinct failure modes:

Systematic bias (chronic risk). All four models overestimate carbs on average, meaning the dominant direction of error is toward too much insulin and hypoglycaemia. GPT-5.4 averages +1.2 units overdose per meal on strong-reference foods. Three meals a day, that’s 3.6 units of extra insulin per day.

Stochastic variability (acute risk). The within-image variation means a single unlucky query could produce a catastrophic outlier. Gemini 2.5 Pro’s worst single query on strong-reference data would have caused an 11.3 unit insulin overdose for a 34g meal. That’s a potential severe hypo.

“But the AI said it was confident”

The prompt I used asks each model to return a confidence score (0 to 1) for every food item it identifies. All four models dutifully returned confidence scores for 100% of items. Surely we can use those to filter out bad estimates?

No. We can’t.

Model	Mean confidence	Correlation with actual accuracy
Claude Sonnet 4.6	0.80	r = -0.01 (zero correlation)
GPT-5.4	0.78	r = -0.17 (weak)
Gemini 3.1 Pro	0.91	r = -0.11 (weak)
Gemini 2.5 Pro	0.91	r = -0.14 (weak)

Claude’s confidence has literally zero correlation with whether it’s right or wrong. It reports ~0.80 confidence whether it’s nailing the bakery cookie (MAE 1.9g) or getting the cheese sandwich wrong by 12g. Worse: when Claude reports high confidence (above 0.85), its estimates are actually less accurate (MAE 17.3g) than when it reports lower confidence (MAE 9.1g). The confidence score is worse than useless — it’s actively misleading.

Gemini 2.5 Pro reports confidence above 0.9 for 86% of all food items, and Gemini 3.1 Pro for 76%, regardless of whether the estimate is anywhere near correct. That’s not calibrated uncertainty. That’s a model saying “I’m very sure” about everything.

here is one faint signal: Claude’s mean confidence does vary slightly by image — the churros (its most misidentified food) get 0.65, while the crema catalana gets 0.92. But the range is so narrow and the calibration so poor that no diabetes app could meaningfully use these scores to protect users.

The bottom line: the only reliable uncertainty signal comes from querying multiple times and observing the spread. The model’s own confidence score is not a safety mechanism.

What this means for you

The DTN-UK stated earlier this year that generic LLMs must never be used as autonomous advisory calculators for insulin delivery. This data is the quantitative evidence base for that statement.

If you’re using AI carb counting in a diabetes app:

Don’t trust it blindly. No model tested is safe for unsupervised insulin dosing. Not even Claude.
Ask it more than once. Query 3-5 times and look at the spread. If the answers vary wildly, the model is uncertain — even if it doesn’t tell you that.
Check what it thinks it sees. If the model identifies “chicken with stuffing” for your grilled fish, you want to know that BEFORE it calculates your carbs.
Know your model. The four-fold difference in consistency is real. Claude Sonnet 4.6 is the safest single-query option today, but it can still be precisely wrong.
An AI that gives you the same wrong answer 500 times is not more trustworthy than one that varies. Consistency is necessary but not sufficient.

The preprint

The full paper — Reproducibility and accuracy of large language model vision APIs for carbohydrate estimation from food photographs — is available as a preprint PDF here and is being submitted to Diabetologia for peer review. The complete dataset (26,904 query results), analysis code, and test images are available in the study repository (access on request).

Supplementary data: – S1: Full prompt text used for all queries – S2: Distributional statistics for all 52 model-image combinations – S3: Per-image accuracy breakdown – S4: Food identification analysis

Discover more from Diabettech - Diabetes and Technology

Subscribe to get the latest posts sent to your email.

First of all, thank you for sharing such a remarkably rigorous and well-designed prompt. I was genuinely impressed by the level of methodological care regarding uncertainty management, serving-size separation, reproducibility, and validation rules.

I have been experimenting with a similar workflow for estimating carbohydrate quantities in meals, and I would like to share an approach that seems to significantly improve portion estimation reliability.

One of the main limitations I observed is not necessarily the nutritional reasoning itself, but rather the volumetric estimation from a single image. To reduce this uncertainty, I started systematically using the same plate for all meal photographs and carefully measured its dimensions beforehand.

The plate specifications are:

– Diameter: 19.8 cm
– Depth: 4.8 cm
– Comfortable usable volume: approximately 1 liter
– Practical reference points:
– 1/4 plate ≈ 250 ml
– 1/2 plate ≈ 500 ml
– 3/4 plate ≈ 750 ml

Using a constant and calibrated container appears to substantially improve the consistency of visual portion estimation, especially for foods such as rice, pasta, mixed dishes, soups, or other volumetric foods.

I suspect that explicitly integrating a “known container” section into the prompt might further improve results. For example:

KNOWN_CONTAINER:
– plate_diameter_cm: 19.8
– plate_depth_cm: 4.8
– usable_volume_ml: 1000
– quarter_plate_ml: 250
– half_plate_ml: 500

and possibly an instruction such as:

You MUST use the known container dimensions as a primary volumetric reference for all portion estimates.

My intuition is that this may reduce one of the largest remaining sources of error in food-image analysis: implicit scale and volume ambiguity.

I thought this idea might perhaps be relevant or useful for your future experiments or evaluations.

Thank you again for your impressive work.

Best regards,

Emmanuel

KJ
15/04/2026 at 15:40

Interesting. I always give in info with my pic and it’s been accurate. I say, starting to eat (brand name) chicken tenders and salad with no carb dressing, here’s my current bg reading, insulin onboard etc” I usually get a small range “estimate 20-25g of carbs” and it breaks down absorption timing. You have to be smarter than the system and input at least the basic info. That’s my take on this. It’s a tool, used by a human. Don’t enter 480g of carbs into your pump if you’ve never eaten 480g of carbs before. Eyeroll.

- admin
  15/04/2026 at 15:47
  
  I’m not going to disagree, but a lot of people don’t use it like that.
  
Sebastian Holtegaard
16/04/2026 at 08:25

Should use the same images and ask people/diabetics to estimate the carb load. Compare the results to the accuracy of the AI models. Without that, this is unfortunately not worth much. You can’t conclude it’s unsafe without the alternative, which may be even less safe.

- admin
  16/04/2026 at 08:39
  
  I think you’ve missed the point. We already know that humans are inaccurate at estimating carb counts. That has been studied many times.
  
  What humans are generally better at is providing much less of a spread on a carb count.
  
  With these models, it’s the variation within a single model that’s the danger. The paella example is a good one, with a huge range of offered responses.
  
  As a user, you don’t see all the variation, so could end up anywhere on the spectrum of values provided, and therefore may have a significantly incorrect guess. But because it came from an LLM query, it’s blindly trusted rather than being questioned.
  
  - Rod Miller
    29/04/2026 at 17:13
    
    I run benchmarks on frontier models for a living and the variance problem is real. I tested Grok 4.20 twice yesterday on the same safety benchmark, same settings, same everything. Got 68% and 79%. Eleven point spread on identical tests. The other models (Opus 4.7, GPT-5.4, GPT-5.5) only varied by 1-2 points. Some models are just less deterministic than others and nobody talks about it.
    
Sacrilegion
30/04/2026 at 02:30

“We denied the AI access to all context any human in this scenario would have access to and are shocked to discover the AI performed non-deterministically.”

Truly riveting.

- admin
  30/04/2026 at 11:54
  
  Or. We did what loads of people are doing loads of times to make a point.
  
  Just giving a photo of food to LLMs isn’t enough.
  
Emmanuel
04/05/2026 at 12:20

First of all, thank you for sharing such a remarkably rigorous and well-designed prompt. I was genuinely impressed by the level of methodological care regarding uncertainty management, serving-size separation, reproducibility, and validation rules.

I have been experimenting with a similar workflow for estimating carbohydrate quantities in meals, and I would like to share an approach that seems to significantly improve portion estimation reliability.

One of the main limitations I observed is not necessarily the nutritional reasoning itself, but rather the volumetric estimation from a single image. To reduce this uncertainty, I started systematically using the same plate for all meal photographs and carefully measured its dimensions beforehand.

The plate specifications are:

– Diameter: 19.8 cm
– Depth: 4.8 cm
– Comfortable usable volume: approximately 1 liter
– Practical reference points:
– 1/4 plate ≈ 250 ml
– 1/2 plate ≈ 500 ml
– 3/4 plate ≈ 750 ml

Using a constant and calibrated container appears to substantially improve the consistency of visual portion estimation, especially for foods such as rice, pasta, mixed dishes, soups, or other volumetric foods.

I suspect that explicitly integrating a “known container” section into the prompt might further improve results. For example:

KNOWN_CONTAINER:
– plate_diameter_cm: 19.8
– plate_depth_cm: 4.8
– usable_volume_ml: 1000
– quarter_plate_ml: 250
– half_plate_ml: 500

and possibly an instruction such as:

You MUST use the known container dimensions as a primary volumetric reference for all portion estimates.

My intuition is that this may reduce one of the largest remaining sources of error in food-image analysis: implicit scale and volume ambiguity.

I thought this idea might perhaps be relevant or useful for your future experiments or evaluations.

Thank you again for your impressive work.

Best regards,

Emmanuel

Diabettech - Diabetes and Technology

Where Diabetes and Technology meet

I Asked AI to Count My Carbs 27,000 Times. It Couldn’t Give Me the Same Answer Twice.

The study

The models disagree with themselves

This variation is invisible to you

The cheese sandwich that
defeats AI

The models don’t always know what they’re looking at

Where does your insulin dose actually land?

Two types of risk

“But the AI said it was confident”

What this means for you

The preprint

Like this:

Related

Discover more from Diabettech - Diabetes and Technology

8 Comments

5 Trackbacks / Pingbacks

Leave a Reply Cancel reply

The study

The models disagree with themselves

This variation is invisible to you

The cheese sandwich that defeats AI

The models don’t always know what they’re looking at

Where does your insulin dose actually land?

Two types of risk

“But the AI said it was confident”

What this means for you

The preprint

Share this:

Like this:

Related

Discover more from Diabettech - Diabetes and Technology

8 Comments

5 Trackbacks / Pingbacks

Leave a Reply Cancel reply

The cheese sandwich that
defeats AI