Ask ChatGPT to estimate the carbs in your lunch. Now ask it again. And again. Five hundred times.
You’d expect the same answer each time. It’s the same photo, the same model, the same question. But you won’t get the same answer. Not even close — and the differences are large enough to cause a hypoglycaemic emergency.
That’s the central finding of a study I’ve just published as a preprint, and it has direct implications for anyone using AI-powered carb counting in a diabetes app.
The study
I submitted 13 food photographs — real meals, photographed on a phone, the way you’d actually use them — to four leading AI models: OpenAI GPT-5.4, Anthropic Claude Sonnet 4.6, Google Gemini 2.5 Pro and Google Gemini 3.1 Pro Preview. Each photo was sent over 500 times to each model. Same prompt every time. Same photo. Same settings.
26,904 queries in total. All at the lowest randomness setting these models offer.
The prompt was adapted from the one used in the iAPS open-source automated insulin delivery system — it’s a real production prompt, not a toy example.
The models disagree with themselves
Every model returned different carbohydrate estimates for the same photo across repeated queries. But the degree of disagreement varies enormously.

Each dot is one of the 13 test images. The violin shape shows the spread. Claude’s variation clusters below 5% for most images; the Gemini models regularly exceed 10-20%.
| Model | Median variation (CV) | Median insulin swing | Worst-case insulin swing |
|---|---|---|---|
| Claude Sonnet 4.6 | 2.4% | 0.9 U | 13.6 U |
| GPT-5.4 | 8.4% | 2.3 U | 16.6 U |
| Gemini 3.1 Pro | 10.3% | 2.9 U | 16.2 U |
| Gemini 2.5 Pro | 11.0% | 4.7 U | 42.9 U |
The worst case? The paella photo. Here’s what happened when I sent it to each model 500+ times:

Every dot is one query. Same photo. Same prompt. Same model. Gemini 2.5 Pro’s estimates span from 55g to 484g — a 429g range, equivalent to 42.9 units of insulin at a 1:10 ICR. Claude’s estimates cluster tightly by comparison.
42.9 units of insulin from a single photo. That’s not a rounding error. That’s a potential fatality.
This variation is invisible to you
When you take a photo in a diabetes app, you get one number back. You have absolutely no way to know whether you received a typical estimate or a tail-end outlier from a distribution you can’t see. For Claude, that single number is probably close to the model’s consensus. For Gemini 2.5 Pro, you could be anywhere on the map.
The cheese sandwich that
defeats AI

Here’s one that should be easy. Two slices of thick white bread (carbs on the packet: 20g per slice) plus cheddar cheese (negligible carbs). Reference value: 40g. Simple, unambiguous, packet-label accuracy.

Left: Three models independently converge on ~28g — consistently wrong by 12g. Right: GPT-5.4 estimates ~74g with high variability — wrong in the other direction by 34g. The red dashed line is the actual value.
Three of four models — Claude, Gemini 2.5 Pro and Gemini 3.1 Pro — independently converge on approximately 28g for a 40g meal. 510 queries from Claude, CV of 0.3%, and every single one is 12g below the actual value. The bread is right there in the photo. The carb value is on the packet.
This is the “precisely wrong” problem: high consistency doesn’t guarantee accuracy. A diabetes app user getting 28g every time would consistently underdose by ~1.2 units.
GPT-5.4 goes the other way: mean estimate 74g, nearly double the reference, and highly variable on top of it.
The models don’t always know what they’re looking at
I found food identification errors in 8 of the 13 test images:
- Bakewell tart: Claude called it a “Linzer torte” in 100% of 510 queries. GPT-5.4 called it a “jam tart” or “cake bar.” Only Gemini 3.1 Pro correctly named it (99.8%).
- Crema catalana: Three of four models called it “creme brulee” 100% of the time. Only Gemini 3.1 Pro got “crema catalana” — in 3.4% of queries.
- Cheese sandwich: Gemini 3.1 Pro added non-existent “deli meat” in 17.4% of queries — hallucinating an ingredient that isn’t there. This could directly inflate carbohydrate estimates.
Some of these misidentifications have modest nutritional impact. Others could change the carbohydrate estimate substantially.
Where does your insulin dose actually land?
On the five images where I had the strongest reference values (packet labels and weighed portions), here’s how often each model’s individual queries would have pushed insulin doses into clinically dangerous territory:

Green is safe (<1U error). Yellow is moderate (1-2U). Orange is clinically significant (2-5U). Red is severe hypo risk (>5U). Claude is the only model with no queries in the orange or red zones.
Claude: 100% of queries in the safe or moderate zone. No single query would have caused more than a 2-unit insulin error.
GPT-5.4: 37% of queries would cause a clinically significant insulin error (>2U). That’s more than one in three queries landing in the danger zone.
Gemini 3.1 Pro Preview: 12% of queries would cause a clinically significant insulin error (>2U). Better than ChatGPT-5.4.
Gemini 2.5 Pro: 12% of queries would cause a >5U error — the threshold associated with severe hypoglycaemia requiring third-party assistance.
Two types of risk
The study identifies two distinct failure modes:
Systematic bias (chronic risk). All four models overestimate carbs on average, meaning the dominant direction of error is toward too much insulin and hypoglycaemia. GPT-5.4 averages +1.2 units overdose per meal on strong-reference foods. Three meals a day, that’s 3.6 units of extra insulin per day.
Stochastic variability (acute risk). The within-image variation means a single unlucky query could produce a catastrophic outlier. Gemini 2.5 Pro’s worst single query on strong-reference data would have caused an 11.3 unit insulin overdose for a 34g meal. That’s a potential severe hypo.
“But the AI said it was confident”
The prompt I used asks each model to return a confidence score (0 to 1) for every food item it identifies. All four models dutifully returned confidence scores for 100% of items. Surely we can use those to filter out bad estimates?
No. We can’t.
| Model | Mean confidence | Correlation with actual accuracy |
|---|---|---|
| Claude Sonnet 4.6 | 0.80 | r = -0.01 (zero correlation) |
| GPT-5.4 | 0.78 | r = -0.17 (weak) |
| Gemini 3.1 Pro | 0.91 | r = -0.11 (weak) |
| Gemini 2.5 Pro | 0.91 | r = -0.14 (weak) |
Claude’s confidence has literally zero correlation with whether it’s right or wrong. It reports ~0.80 confidence whether it’s nailing the bakery cookie (MAE 1.9g) or getting the cheese sandwich wrong by 12g. Worse: when Claude reports high confidence (above 0.85), its estimates are actually less accurate (MAE 17.3g) than when it reports lower confidence (MAE 9.1g). The confidence score is worse than useless — it’s actively misleading.
Gemini 2.5 Pro reports confidence above 0.9 for 86% of all food items, and Gemini 3.1 Pro for 76%, regardless of whether the estimate is anywhere near correct. That’s not calibrated uncertainty. That’s a model saying “I’m very sure” about everything.
here is one faint signal: Claude’s mean confidence does vary slightly by image — the churros (its most misidentified food) get 0.65, while the crema catalana gets 0.92. But the range is so narrow and the calibration so poor that no diabetes app could meaningfully use these scores to protect users.
The bottom line: the only reliable uncertainty signal comes from querying multiple times and observing the spread. The model’s own confidence score is not a safety mechanism.
What this means for you
The DTN-UK stated earlier this year that generic LLMs must never be used as autonomous advisory calculators for insulin delivery. This data is the quantitative evidence base for that statement.
If you’re using AI carb counting in a diabetes app:
Don’t trust it blindly. No model tested is safe for unsupervised insulin dosing. Not even Claude.
Ask it more than once. Query 3-5 times and look at the spread. If the answers vary wildly, the model is uncertain — even if it doesn’t tell you that.
Check what it thinks it sees. If the model identifies “chicken with stuffing” for your grilled fish, you want to know that BEFORE it calculates your carbs.
Know your model. The four-fold difference in consistency is real. Claude Sonnet 4.6 is the safest single-query option today, but it can still be precisely wrong.
- An AI that gives you the same wrong answer 500 times is not more trustworthy than one that varies. Consistency is necessary but not sufficient.
The preprint
The full paper — Reproducibility and accuracy of large language model vision APIs for carbohydrate estimation from food photographs — is available as a preprint PDF here and is being submitted to Diabetologia for peer review. The complete dataset (26,904 query results), analysis code, and test images are available in the study repository (access on request).
Supplementary data: – S1: Full prompt text used for all queries – S2: Distributional statistics for all 52 model-image combinations – S3: Per-image accuracy breakdown – S4: Food identification analysis
Discover more from Diabettech - Diabetes and Technology
Subscribe to get the latest posts sent to your email.
Interesting. I always give in info with my pic and it’s been accurate. I say, starting to eat (brand name) chicken tenders and salad with no carb dressing, here’s my current bg reading, insulin onboard etc” I usually get a small range “estimate 20-25g of carbs” and it breaks down absorption timing. You have to be smarter than the system and input at least the basic info. That’s my take on this. It’s a tool, used by a human. Don’t enter 480g of carbs into your pump if you’ve never eaten 480g of carbs before. Eyeroll.
I’m not going to disagree, but a lot of people don’t use it like that.
Should use the same images and ask people/diabetics to estimate the carb load. Compare the results to the accuracy of the AI models. Without that, this is unfortunately not worth much. You can’t conclude it’s unsafe without the alternative, which may be even less safe.
I think you’ve missed the point. We already know that humans are inaccurate at estimating carb counts. That has been studied many times.
What humans are generally better at is providing much less of a spread on a carb count.
With these models, it’s the variation within a single model that’s the danger. The paella example is a good one, with a huge range of offered responses.
As a user, you don’t see all the variation, so could end up anywhere on the spectrum of values provided, and therefore may have a significantly incorrect guess. But because it came from an LLM query, it’s blindly trusted rather than being questioned.