Five AI models, three users, one finding: the settings came from the textbook, not the data

Paste your Nightscout data into ChatGPT. Ask it to recommend basal rates, ISF, ICR, DIA. Now do it again. And again. Fifty times.

Then give the same data to four other models. Do the same thing.

Then check what they actually recommended against what you run. And check whether the glucose values they cited in their reasoning actually match your CGM data.

After seeing what happened with Carb Counting, that’s what I did — 1,950 times, across five frontier AI models and three real AID users. What I found was unexpected: the models weren’t really reading the data. They were reciting the textbook.

Transparency: this analysis used Claude Code (Anthropic) for data analysis and manuscript drafting. Two of the five models tested are Anthropic products. Sonnet 4.6 showed the most adaptation to user data on some metrics, but no model was adequate for clinical use without context provision and bounds checking. No vendor funding. All data and code openly available.

The setup

Five models. Claude Sonnet 4.6, Claude Opus 4.7, GPT-5.4, Gemini 2.5 Pro, Gemini 3.1 Pro Preview. Each given a week of 5-minute CGM data plus hourly device summaries. The prompt asked for 24-hour basal, ISF, ICR, and DIA — with citations and confidence scores.

Three users, chosen to span a range:

The AAPS/Boost user. ISF 91-163, ICR 21-26. Moderate sensitivity.
Trio user A. ISF 90-115, ICR 15. Also moderate.
Trio user B. ISF 45, ICR 7.5. Needs roughly twice as much insulin per correction and per gram of carb as the other two.

All three have stable HbA1c and time-in-range on settings they’ve tuned themselves — which is how most of us on AID systems manage this. Self-titration based on continuous CGM feedback is standard practice.

I deliberately didn’t tell the models what system the users run, what their current settings are, or what insulin they use. That’s not how you’d use these tools in the real world — you’d tell them — but I needed a baseline to compare against. More on that shortly.

Fifty calls per model per user. Pre-registered predictions on OSF before looking at the results. Then 600 more calls where I did tell them the settings.

They all recommended the same thing

Here’s what I expected to find: model rankings. This one’s best at ISF, that one’s most consistent, avoid this one for ICR.

Here’s what I actually found: it didn’t matter much which model you used. They all converged on similar values regardless of whose data they were looking at.

The three users’ ISF profiles span 79 mg/dL/IU. The models’ recommendations? Between 2 and 63, depending on the model. One model — Gemini 3.1 Pro Preview — recommended ISF 49-51 for all three users. Virtually identical. Despite wildly different input data.

I asked two of the models what “typical” pump settings look like with no data at all. ISF around 50-60. ICR around 12.5. DIA 4 hours. Almost exactly what they recommend with data.

90% of the time, each model’s recommendation was closer to its own cross-user average than to the user’s actual profile.

The models aren’t deriving settings from your data. They’re generating what pump settings typically look like in the medical literature they were trained on.

Black diamonds = what each user actually runs. Coloured dots = what each model recommended. The diamonds spread wide. The dots don’t.

Tell them your settings and the numbers improve. But not for the reason you’d hope.

I reran the test with a modified prompt: “This user runs Trio. Current DIA is 9 hours. Current ISF is 45. Current ICR is 7.5.” And so on for each user.

The numbers improved dramatically:

DIA snapped from 4-5 hours to exactly 9-10 hours. Every model. Every user. Fixed.
ICR went from near-zero closeness to 58-98% within ±10%.
ISF improved by 15-52 percentage points in most cases.

Faded = without context. Solid = with context. Look at the DIA column.

Looks like the fix, right?

Then I looked at the reasoning. Two of the three users don’t log carbs. No meal boluses in the data. Nothing for the model to use to independently verify the ICR I told it. And the model knows this. Sonnet said:

“ICR values are kept close to current settings (20-25 g/IU) as no meal records are available.”

Without context, the same model had said:

“Without bolus data, ICR cannot be meaningfully calculated from this dataset.”

It can’t work out ICR from the data. It says so explicitly. But it recommends one anyway. Without context, it anchors on the textbook average. With context, it echoes back whatever you told it.

The improvement isn’t reasoning. It’s anchor substitution.

This plays out differently across settings:

DIA — pure textbook. No data signal. Context fixes it completely.
ICR — mostly textbook. No meal data for two users. Model echoes whatever it’s told.
ISF — mixed. CGM data does contain correction-response patterns. Context helps, and the model partly reasons from data.
Basal — most data-driven. Overnight patterns are directly visible. Here, telling the model the settings actually made some results worse — as if it stopped reading the CGM and started echoing instead.

The citations look professional. They’re mostly wrong.

Every model produced citations like: “at 16:34, glucose fell from 110 to 85 mg/dL, supporting a reduction in basal at hour 16.”

I checked every one. Against the actual 5-minute CGM readings.

Green = the model quoted the glucose value correctly. Red = it didn’t. Purple = the timestamp doesn’t exist in the data.

No model correctly quoted more than two-thirds of the glucose values it cited. Despite being given exact integer values in the data.

The errors are mostly misattributions — the cited value exists somewhere in the 7-day data, just not at the timestamp the model attributed it to. The models aren’t making up glucose numbers. They’re mixing up which number goes with which time.

And here’s the thing: providing context didn’t fix this. The citation error rate was essentially unchanged whether or not the model knew the user’s settings. The settings anchoring and the citation problem are two completely separate issues.

Spending more doesn’t help

Claude Opus 4.7 costs 4.6× more than Sonnet 4.6. It adapted less. Its ISF adjustment across three users was only 8 mg/dL/IU compared to Sonnet’s 63. Nearly identical recommendations for all three users. Substantially more citation errors.

A larger model may have absorbed more training data, reinforcing what “typical” looks like rather than making it more responsive to what’s in front of it.

What the models do well

This isn’t a “bin it” story. All five models understood the CGM and device data format. They identified real patterns — overnight lows, post-prandial rises, site-change effects, high IOB periods. They produced structured 24-hour profiles reliably. They cited specific timestamps from the data (even if the values were often wrong).

Sonnet adapted ISF by 80% of the reference range between users — not adequate, but meaningfully more than the others (3-26%). GPT-5.4 had zero fabricated timestamps across all three users.

For users whose ISF and ICR sit near typical ranges (ISF 80-120, ICR 12-20), the recommendations are in the right ballpark. The problem is specific to users outside that range. And many of us in the looping community are exactly those users.

So what do you do with this?

If you’re using an AI tool for pump settings advice:

Always tell it what system you’re on and what your current settings are. The difference is enormous — especially for DIA, which the models get catastrophically wrong without context.
Don’t assume the output is derived from your data. Based on this exploratory three-user study, the models appear to be primarily outputting what settings typically look like in their training, with your data as a secondary influence.
If your ISF is below 50 or your ICR is below 10, be especially cautious. You’re outside the models’ comfort zone.
Don’t trust the cited glucose values. Check them against your actual CGM data. Up to 70% don’t match.
Providing your settings improves the numbers, but the model may just be echoing them back rather than independently confirming them from your data.

If you’re building a tool:

Always provide the user’s AID system and current settings as context. Test across the ISF/ICR spectrum. Verify citations programmatically. And be transparent about what the AI is actually doing.

The study

Pre-registered on OSF (10.17605/OSF.IO/Q4UX9). Six hypotheses; one confirmed, two partially, three not confirmable. The anchoring finding was not pre-registered — it emerged from the data. This is a three-user exploratory study and needs confirmation with more users. Full dataset and code will be on OSF. The pre-print of the paper this article is based off is here.

If you’re an AID user willing to share a week of anonymised Nightscout data for a larger follow-up (15-30 users), get in touch. Particularly interested in Loop and CamAPS users, and anyone with ISF below 40 or above 150. tim@diabettech.com.

Utilising AI APIs is also not a cheap endeavour, so if anyone would like to help fund a larger study, contributions can be made at Stripe.

Tim Street. Independent researcher, AAPS/Boost developer, Diabettech. No vendor funding. API costs ~$260. Claude Code (Anthropic) used for analysis and drafting; two tested models are Anthropic products. Sonnet showed the most data adaptation on some metrics; no model was adequate for clinical use without context provision and bounds checking. All code and data available for independent verification.

Discover more from Diabettech - Diabetes and Technology

Subscribe to get the latest posts sent to your email.

Diabettech - Diabetes and Technology

Where Diabetes and Technology meet

Five AI models, three users, one finding: the settings came from the textbook, not the data

The setup

They all recommended the same thing

Tell them your settings and the numbers improve. But not for the reason you’d hope.

The citations look professional. They’re mostly wrong.

Spending more doesn’t help

What the models do well

So what do you do with this?

The study

Like this:

Related

Discover more from Diabettech - Diabetes and Technology

Be the first to comment

Leave a Reply Cancel reply

The setup

They all recommended the same thing

Tell them your settings and the numbers improve. But not for the reason you’d hope.

The citations look professional. They’re mostly wrong.

Spending more doesn’t help

What the models do well

So what do you do with this?

The study

Share this:

Like this:

Related

Discover more from Diabettech - Diabetes and Technology

Be the first to comment

Leave a Reply Cancel reply