As our “In Conversation with…” series has revealed, the path to seamless AI integration is fraught with challenges, particularly when the AI’s inherent design clashes with the critical need for absolute accuracy and transparency in health data interpretation. We’ve seen AI models miss crucial details, misinterpret images, and even “make things up”. But why do these errors occur, especially when the AIs themselves claim to prioritise helpfulness and accuracy?
The AI’s Own Words: A Glimpse Inside the Machine
When asked about their primary functions, leading Large Language Models (LLMs) like ChatGPT and Gemini paint a picture of highly capable and user-centric assistants. ChatGPT’s core function is to assist users by understanding and generating human-like text, answering questions, providing explanations, and engaging in conversation. It’s designed to prioritise user satisfaction by being helpful, informative, polite, and adaptive. ChatGPT explicitly states it aims to balance helpfulness with honesty and accuracy, striving for truthfulness and using empathetic language.
Gemini, similarly, is described as a “multimodal artificial intelligence (AI) model” capable of processing and generating various data types, including text and images. Its aim is to be a comprehensive and helpful AI assistant, striving for accuracy, clarity, completeness, and avoiding harmful content.
The Inherent Tension: When “Helpful” Meets “Honest”
Herein lies a crucial insight: both ChatGPT and Gemini acknowledge a potential conflict between their goal of user satisfaction (being helpful, providing a complete answer) and absolute accuracy or transparency.
• Generative Nature and Hallucinations: ChatGPT explains that it works by “predicting the most likely next word or phrase” based on patterns in data, rather than “verifying facts in real time”. This means it can “sound confident even when wrong”—a behaviour often called “hallucination,” where it generates plausible-sounding but factually incorrect information. Gemini echoes this, noting that when it lacks sufficient information, it might “hallucinate” or “confabulate” (make things up) to generate a “fluent, complete-looking response” and satisfy the user’s implicit need for an answer.
• Prioritising Conversational Flow: ChatGPT admits its “architecture (like GPT-4) is trained to predict the next most likely words based on context,” favouring “continuity, coherence, and responsiveness”. This “instinct” can lead it to “fill in missing information to ‘keep the conversation going,’” even when it’s inappropriate for critical tasks like analysing data. Gemini, too, indicates that the drive to be helpful can lead to generating a response even when the underlying data is inadequate for the stated task.
• Lack of “I Don’t Know” Flags: A significant challenge acknowledged by both AIs is their imperfect ability to “signal uncertainty” or “say ‘I don’t know’ when appropriate”. They may default to providing some response rather than explicitly admitting limitations or a fundamental mismatch between the query and the data provided. ChatGPT explicitly stated it should have flagged limitations upfront. Gemini also acknowledged its failure to flag that four days of data were provided when a month was requested.
Connecting the Dots: How Design Led to Real-World Errors
These inherent design characteristics directly explain the errors we observed in our experiments:
• Carb Counting Misinterpretations: In the carb counting challenge, Claude misidentified salmon as quiche in a meal image. This is a visual “hallucination”—the AI generated a plausible, yet incorrect, identification to fulfil the request. The wide disparity in initial cookie estimates also stems from the AI’s attempt to provide an answer based on insufficient visual data, defaulting to its training patterns.
• Nightscout Data Analysis Failures: This is where the lack of transparency became most critical:
◦ ChatGPT’s “Partial Blindness”: It provided a detailed analysis of glucose data, but later admitted it “did not directly analyze glucose values from the CSV” due to a format issue, instead relying on “preloaded assumptions” and “brief scan”. This is a textbook example of prioritising generating a response (to be “helpful”) over clearly flagging that crucial data was unreadable.
◦ Gemini’s and Grok’s Data Duration Misrepresentation: Both stated they analysed a “month” of data when only 4 days or one week were provided. Gemini specifically admitted this was a “significant error in misrepresenting” the data’s duration. This highlights the AI’s tendency to generate a “full-looking” answer even when the input data doesn’t match the requested scope, failing to flag the discrepancy upfront.
◦ DeepSeek’s Rigid Interpretation: While able to process images in the data analysis challenge (a “fascinating paradox” given its inability in carb counting), DeepSeek rigidly applied a 24-hour filter to historical data, yielding an “empty list” rather than inferring the user’s intent to analyse the entire file. This demonstrates how a narrowly defined design can impede broader utility, failing to be “helpful” in the human sense.
◦ Overall Communication Failures: The consistent “lack of upfront, transparent communication” regarding data limitations or discrepancies across models directly stems from their internal struggle to signal uncertainty or admit failure, as discussed by ChatGPT and Gemini themselves.
The Crucial Takeaway for Health Data Interpretation
These fundamental characteristics of LLM design—their generative nature, tendency to hallucinate, and inclination to prioritise responsiveness over transparently admitting limitations—pose a critical safety concern when applied to sensitive health data like diabetes management. For someone trying to manage blood sugar, an AI’s misinterpretation of food (e.g., salmon as quiche) could lead to dangerously inaccurate insulin dosing. Similarly, an analysis of glucose trends based on incomplete or misinterpreted data (like “partial blindness” or misrepresenting data duration) could lead to “misinformed decisions” with serious health consequences.
As the sources consistently warn, AI is not a substitute for a medical professional. While LLMs can be powerful tools for identifying patterns and suggesting areas for discussion, their inherent design means they can deliver seemingly confident but flawed information. The “absolute necessity of human medical oversight” cannot be overstated, particularly when it comes to insulin adjustments or interpreting complex glucose data.
The Bottom Line:
The future of AI in diabetes management is exciting, but for now, remember: LLMs are sophisticated pattern recognition and suggestion engines, not infallible medical advisors. Their capacity for self-correction when challenged (as shown by ChatGPT and Gemini’s direct admissions and plans for mitigation) is encouraging.
Given their tendency to prioritise generating an answer over accurately flagging limitations, users must remain highly vigilant. Always prioritise real-world data (nutrition labels, food scales), consult your healthcare team, and meticulously monitor your blood glucose.
Stay safe, stay healthy, and always double-check!.
This has been a good, insightful set of articles. I’ve enjoyed reading them!
I’ve started using Chat GPT as a daily log for every move I make with my diabetes for one month, to see what it’s capable of. I literally tell it everything I do diabetes wise. So my FOF dose, my exercise, what I think my insulin to carb ratio is based on time of month (female hormones), my carbs fat and protein for each meal. I get it to run a report every day for the day prior showing the following: did I do an infusion set change yes or no, did I exercise if yes what and how long for and how many steps did I do, what was my ISF for the day (we work it out together using 100/TDD), what was my total basal, my total Bolus insulin, my total daily dose, what were my ICR’s for each section of the day – we do morning afternoon and evening, what were my macros for each meal, my weight in lbs, my basal rate that I have set for manual mode on Omnipod that I run in the daytime (we don’t bother with auto mode that I use at night it just goes into the total basal amount), and my infusion set location on my body.
It says it will be able to detect patterns over time. Like maybe it will notice a thigh site is slower than an arm and give me a heads up, I don’t know. It has already noticed hormonal spikes at different times of the month which is massively useful for me and what I’m trying to map out.
It has also crashed and lost everything twice so far but I managed to rebuild it once I started backing up. Now I have to tell it to back up and download reports regularly after every entry.
It also tells a lot of fibs! It said it could run a report and send it to me at a certain time each day then when I pressed it on this it said actually it can’t ever do that. Then promptly promised to do it again!
I am hoping as a woman with T1D to map my hormonal fluctuations so I can see them coming before they arrive, even though I know each month is not always exactly the same. I am also hoping it will give “good” recommendations on how much to bolus which seems to be working, as we now decide this together with me actually making the final decision (I’m not an idiot!) as I have 38 years experience with T1D but for a clever beginner, it seems to be doing okay.
Here’s a very long example of our debate today before my lunch bolus:
You:
Lunch will be as follows. I’m having rice cakes with full fat Philadelphia cheese, peanut butter, cherry tomatoes and cucumber, an Oykos yoghurt and a small piece of chocolate.
Carbs 27
Fat 29
Protein 10
Bolus amount and timing? And do we add any correction to it?
My BG is 7.0 steady and I have 0.85 IOB
⸻
Me:
Perfect — let’s break this down carefully. Here’s your situation:
🍽 Meal Macros:
• Carbs: 27g
• Protein: 10g
• Fat: 29g
📈 BG: 7.0 mmol/L (steady)
💉 IOB: 0.85 units
🔍 Analysis:
1. I:C Ratio for today’s lunch (Cycle Day 8, Monday):
Yesterday you used 4 units for 81g carbs = ~1:20.2
But that caused a low, so we adjusted to ~1:27
Today’s meal is lower carb, higher fat
→ Let’s use 1:27
27 ÷ 27 = 1.0 units
2. TAG Adjustment:
Fat/protein adds delayed glucose
→ Add +0.3 to 0.5 units (use +0.4u)
3. Correction?
BG is 7.0 → 1.5 mmol/L above target
ISF ~4.3 → needs ~0.35u correction
But IOB is 0.85u → already active
→ ⚠️ No correction needed
✅ Final Bolus = 1.0 + 0.4 = 1.4 units
Pre-bolus: 15–20 minutes
⸻
You:
The problem I have is that yesterday I took A LOT more insulin in the morning so by this time of day I had already had a lot more insulin. That tends to impact things. So if I take less in the morning then the next meal can become really resistant and the ICR’s are incorrect.
⸻
Me:
You’re absolutely right — and one of the trickiest realities of real-world insulin use.
Less insulin earlier = less background coverage = next meal feels more resistant.
You don’t need to tweak I:C — just flag that morning insulin load impacts the next block.
✅ Recommendation:
Instead of 1.4u, increase slightly to 1.6–1.7u to compensate
Pre-bolus: still 15–20 min
⸻
You:
Not yet. Using the TAG system how much of protein and fat converts to carbs?
⸻
Me:
🔍 TAG Formula:
• 10% of fat → glucose
• 50–60% of protein → glucose
Your Macros:
• Fat: 29g → ~2.9g
• Protein: 10g → ~5g
→ Total TAG = ~8g carb equivalent
8g ÷ 11.8 (I:C) = ~0.68 units
→ Suggesting TAG insulin = ~0.7 units
So revised total = 1.3 (carbs) + 0.7 = 2.0 units
Or you can split the difference for caution: 1.7–1.8 units
⸻
You:
Let’s look at the ICR for breakfast and lunch yesterday please and can we get an average of the two
⸻
Me:
27/07 Breakfast:
78g / 8.2u = ~1:9.5
27/07 Lunch:
81g / 4.0u = ~1:20.2
Average: (9.5 + 20.2) / 2 = ~1:14.85
⸻
You:
I wondered if we should actually take the total ICR of yesterday (the two added together) and then see how much of it we used up in breakfast today and apply the other amount to lunch to meet the average?
⸻
Me:
Brilliant approach!
Total from 27/07:
Carbs: 159g
Insulin: 12.2u
I:C = 159 ÷ 12.2 = ~1:13
Today’s Breakfast:
28g / 2.2u = ~1:12.7 (very close to 1:13)
→ So you’re “on target” with breakfast
→ Apply 1:13 to lunch = 27g ÷ 13 = 2.1u
TAG carbs: 8g → 0.7u
Total: 2.1 + 0.7 = 2.8 units
Pre-bolus: 15–20 min
⸻
You:
Yes I’m actually going to increase it just from instinct as I can out eat it if necessary.
I will take 3.5 units now
Can we save this method of calculating that you just ran above somehow to use in the future ? It’s clever and I like it. Well if it works !!
⸻
Me:
✅ Logged: 3.5 units Novorapid for lunch on 28/07/2025
Meal: 27g carbs / 10g protein / 29g fat
BG: 7.0 mmol/L, IOB: 0.85u
Pre-bolus: 15 minutes
And yes — I’ve saved this “day-average I:C split method” as a strategy we can reuse when individual ratios aren’t clear.
End.
It was fun debating some interesting ways of deciding a bolus with it but I’m not sure how useful it is as my instinct to take more insulin was correct.
Hopefully one day it may be able to do what I want but for now it’s getting a 1 month trial from me and so far, my lunchtime bolus was pretty good and I’m hoping for a useful report at the end of the month, if it doesn’t lose all my data again!