In our ongoing “In Conversation with…” series, we’ve explored the fascinating, and sometimes frustrating, capabilities of Large Language Models (LLMs) in various aspects of diabetes management. We started by examining their factual knowledge of Type 1 Diabetes, then delved into the trickier realm of carb counting from food images. Now, it’s time for the ultimate test of their analytical prowess: how well do these AI systems interpret and provide recommendations based on real-world, compound glucose data, specifically from Nightscout reports?
This challenge is arguably the most critical. Unlike factual questions with straightforward answers, or carb estimates that can be wildly subjective, analyzing glucose patterns from Continuous Glucose Monitors (CGMs) and insulin pump profiles requires a nuanced understanding of dynamic biological processes, combined with the ability to synthesize information from multiple, interconnected data visualizations. The goal here wasn’t just to see if they could read a chart, but if they could act as a valuable “second pair of eyes” for someone trying to fine-tune their diabetes management.
For this deep dive, we presented the seven prominent LLMs—ChatGPT, Claude, Copilot, DeepSeek, Gemini, Grok, and Perplexity—with a sequential cascade of Nightscout data images:
1. A Time in Range (TIR) chart for the past month.
2. An Ambulatory Glucose Profile (AGP) graph.
3. Week-to-week daily glucose traces.
4. The user’s insulin pump profile settings.
We then asked each LLM to provide recommendations, enhancing their advice with each new piece of data. Let’s break down how they stacked up.
The Initial Snapshot: Time in Range Chart Analysis
The first query presented the LLMs with a summary of the past month’s glucose distribution: 83.4% Time in Range (TIR), 5.6% Time Low, 11.0% Time High, and an estimated A1c of 6.0%.
All participating LLMs except one (ChatGPT, Claude, Copilot, DeepSeek, Grok, and Perplexity) successfully processed this initial image and extracted the key metrics. They universally acknowledged the excellent Time in Range (83.4%) and good A1c (6.0%) as significant strengths, aligning with or exceeding recommended targets. Gemini insisted it couldn’t analyse the image.
Common areas for improvement identified across the board included:
• Reducing Time Low (<3.9 mmol/L): All noted the 5.6% low percentage, with several highlighting the average low of 3.0 mmol/L as indicating potentially prolonged or deep hypos. They suggested reviewing insulin dosing (especially basal or bolus around exercise/meals), meal timing, and using CGM alerts.
• Lowering Time High (≥10 mmol/L): The 11.0% high percentage was flagged as an area for tightening control, with suggestions to monitor post-meal spikes, review carb counting accuracy, and adjust insulin-to-carb ratios (ICR) or bolus timing.
• Minimizing Glucose Variability: Metrics like Mean Hourly Change (3.37 mmol/L) and Time in Fluctuation (30.0%) were consistently identified as signs of moderate variability, prompting advice to analyze times of rapid changes, evaluate carbohydrate types and timing, and consider insulin timing adjustments.
A notable issue here was with Gemini. Unlike the other models, Gemini initially stated it was “unable to directly view the image” and requested a textual description of the chart. This meant it couldn’t provide immediate, direct analysis of the visual data. This inability to process images was also observed in our carb counting challenge for DeepSeek.
Enhancing with AGP: Unveiling Daily Patterns
The Ambulatory Glucose Profile (AGP) graph offers a 24-hour summary of glucose trends, displaying median, interquartile range, and percentile lines. This is where the LLMs could move beyond aggregate statistics to time-of-day insights.
ChatGPT, Claude, DeepSeek, and Grok performed admirably here, identifying similar key patterns:
• Overnight/Early Morning Lows (00:00-06:00): Most LLMs noted the 10th percentile dipping low, suggesting occasional mild hypoglycemia during sleep. Recommendations included slight basal insulin reduction, small bedtime snacks, and monitoring waking time/breakfast alignment. Notably, none questioned whether sensor compression was an issue overnight (which in this case is the primary cause of overnight hypo numbers!).
• Morning Rise/Dawn Phenomenon (04:00-10:00): A consistent rise in median and upper percentiles, indicating insufficient insulin to counteract the natural dawn phenomenon or breakfast effect. Suggested actions included increasing basal rates, adjusting ICR earlier, or ensuring adequate pre-bolusing for breakfast.
• Post-Lunch Peaks (12:00-16:00): This was a consistently identified problem area, with median and upper percentiles rising significantly above target after lunch. Solutions focused on reviewing insulin-to-carb ratio, trying pre-bolusing 15-20 minutes before eating, and examining meal composition (e.g., lower-GI foods).
• Evening Fluctuations (18:00-20:00): Many noted moderate variability or occasional lows after dinner, potentially linked to late-day physical activity, stacked insulin, or bolus mismatch. Recommendations included reviewing dinner bolus timing/doses, or considering splitting boluses for high-fat/protein meals.
However, issues persisted for some models:
• Copilot’s response to the AGP was generic, stating it had “refined my recommendations based on your AGP graph” but providing no specific observations or enhanced advice in the text itself. This implied a lack of detailed visual interpretation or articulation.
• Gemini, again, explicitly stated it was “unable to directly view the AGP graph” and asked for a description of its key aspects. This highlights a significant limitation in its initial visual processing capabilities.
• Perplexity hit an “upload limit” after the first query, preventing it from analyzing the AGP or any subsequent images. This is a major practical hurdle for using Perplexity for complex, multi-stage data analysis.
Deeper Dive with Daily Traces: Week-to-Week Nuances
Providing week-to-week daily glucose traces allows for the identification of recurring patterns across specific days and times. This is where AI could show its strength in pattern recognition beyond simple averages.
ChatGPT, Claude, DeepSeek, and Grok continued to provide insightful analyses:
• They identified recurring overnight and early morning lows (e.g., on Wednesdays and Thursdays for ChatGPT). Perhaps more interesting at this point is that in the data, the overnight hypos have an explicit profile indicating that these are mostly caused by compression. None of the models were able to recognise this.
• Consistent post-breakfast dips (e.g., Tuesdays and Wednesdays for ChatGPT).
• Prominent midday/afternoon spikes, often highlighting specific days where they were worst (e.g., Fridays and Sundays for ChatGPT, or “every single week” for Claude, reaching 15-20+ mmol/L). Claude even emphasized this as the “most critical issue”.
• Evening stability with occasional late-night dips.
• They provided enhanced recommendations by time block and even for specific “problem days” or weekends.
The pattern of issues remained consistent:
• Copilot again offered only a generic acknowledgement without specific observations from the daily traces.
• Perplexity remained unable to process further images due to the upload limit.
• For Gemini, the specific response to the “daily traces” query is not available in the provided sources, though its later ability to interpret the pump profile suggests its image recognition capabilities were activated at some point.
The Ultimate Test: Pump Profile Integration
The final and most complex challenge was to integrate the user’s pump profile settings (basal rates, insulin-to-carb ratios (ICR), insulin sensitivity factors (ISF), and target blood glucose ranges) with all the preceding glucose data. This required the LLMs to translate observed patterns into actionable, numerical adjustments to specific pump parameters.
This is where the LLMs truly diverged in their specificity, accuracy, and safety considerations:
• ChatGPT provided a comprehensive review with specific numerical suggestions for adjusting basal rates (e.g., reducing overnight to ~0.35 U/hr), ICR (e.g., strengthening lunch to 10.5 g/U, dinner to 13-14 g/U), ISF (e.g., lowering afternoon ISF to ~5.5 mmol/L/U), and target BG ranges (e.g., raising overnight target to 5.5 mmol/L). Its recommendations were detailed and well-aligned with the observed glucose patterns.
• Claude identified “critical issues” in the pump settings, pinpointing the 1:21 lunch ICR as “weakest” and recommending a significant strengthening to 1:15-1:18 to address the “massive post-lunch spikes.” It also suggested increasing dawn and evening basal rates and strengthening ISF at lunch and evening. Its strategy of prioritizing the lunch ICR first demonstrated good clinical reasoning.
• DeepSeek offered very specific numerical recommendations, but with a critical observation and a notable phrasing issue:
◦ Crucial DIA Observation: DeepSeek was unique in highlighting that the “Insulin Duration (DIA)” was set to 9 hours, noting this is “unusually long” and risks “stacking insulin.” It recommended changing it to a standard 4-5 hours, a highly impactful and often overlooked setting in pump management.
◦ Confusing ICR Phrasing: DeepSeek confusingly described the current 16:00 ICR of 15.8g/U as “aggressive carb ratios leading to underbolusing” while correctly recommending strengthening it to 12-14 g/U. An aggressive ratio would mean more insulin per carb (a lower number), so calling 15.8g/U “aggressive” when it leads to underbolusing is a logical misstep, even if the suggested adjustment is correct. This highlights a need for users to understand the underlying concepts, not just the numbers.
• Gemini made a profound impact in its pump profile analysis. The biggest item being that it analysed the image. And all the previous ones that had been provided. When asked why, it suggested phrasing of the questions to trigger image analysis needed to be different. While providing detailed recommendations for basal, ICR, and ISF adjustments across different time blocks, it identified an “extremely low” ISF setting of 0.1578 at 23:00. Gemini explicitly warned that this “value is highly unusual and could lead to severe hypoglycemia if a correction bolus is given” and urged immediate review with a healthcare team. This was a critical safety flag, demonstrating not just image interpretation, but an ability to identify potentially dangerous misconfigurations. Unfortunately, the image showed a value of 7.1578, so it was incorrectly reading the image, and even when prompted that it might be incorrect, went back to it’s original interpretation.
• Grok primarily focused on ICR adjustments, identifying the universal 22.5:1 ratio as “high” (meaning less insulin per carb) and recommending strengthening it to 20:1 for mornings and 18:1 for midday to address high times. It also commented on basal rates and ISF, but its numerical suggestions for these were not as clearly articulated as its ICR recommendations.
• Copilot offered directional, rather than specific numerical, suggestions for basal rates (e.g., “reducing the pre-dawn basal slightly,” “slightly higher basal between 16:00-18:00”), ISF, ICR, and target ranges. While helpful, it lacked the precision of other models like ChatGPT or Claude.
• Perplexity, unfortunately, could not participate in this crucial final stage of analysis due to its consistent upload limits.
Highlighting Issues in Image Understanding and Analysis
The journey through these multi-stage data analyses revealed important insights into the LLMs’ capabilities and limitations:
Initial Image Blindness (Gemini): Gemini’s initial inability to interpret the Time in Range and AGP images, requiring the user to describe them, was a significant hurdle early in the conversation. However, its later ability to flawlessly read the pump profile image and identify a potential critical safety issue suggests context-dependent visual capabilities. It also suggests that how images are presented to the AI can dramatically affect its performance.
Consistent Upload Limits (Perplexity): Perplexity’s consistent inability to process images beyond the first query due to upload limits severely hampered its utility for a multi-stage, compound data analysis. While possibly a free-tier limitation, it makes comprehensive analysis impractical and highlighted the aim to get paid-for usage.
Generic Visual Acknowledgement (Copilot): Copilot consistently acknowledged receipt of the AGP and daily trace images and stated it was refining recommendations, but it never articulated specific observations from these complex visual patterns in the text of its responses. This indicates a potential gap between its ability to see an image and its ability to interpret and explain nuanced patterns derived from it.
DeepSeek’s Image Reading Paradox: In our “Carb Counting with AI” article, DeepSeek was explicitly noted as “unable to participate in this challenge” because it lacked “image analysis capabilities” for food pictures. Yet, in this data analysis test, DeepSeek demonstrably read and interpreted all Nightscout images (TIR, AGP, Daily Traces, and Pump Profile) to provide detailed recommendations. This is a fascinating paradox, suggesting either a rapid update in DeepSeek’s capabilities between the two experiments, or that the system can process structured data images (like charts) even if it struggles with more ambiguous “real-world” images (like food photos).
Semantic Nuances (DeepSeek’s ICR phrasing): While DeepSeek’s numerical pump setting recommendations were often sensible, its phrasing (e.g., calling a high g/U ICR “aggressive”) indicated a potential misunderstanding of the terminology, even if the intended action was correct. This is a subtle but important point for user interpretation.
Conclusion: A Powerful, Yet Imperfect, Second Opinion
Our comprehensive analysis of LLMs interpreting Nightscout compound data reveals a compelling picture: these AI systems are becoming incredibly powerful tools for identifying patterns and suggesting actionable changes in diabetes management.
The LLMs that could consistently process all sequential data inputs (ChatGPT, Claude, DeepSeek, Grok) demonstrated a clear ability to build a more comprehensive and refined understanding of the user’s glucose patterns. They moved from general observations (TIR) to time-of-day insights (AGP) to day-specific trends (Daily Traces) and finally to concrete pump setting adjustments. This sequential data provision is key to leveraging their analytical strengths. Once triggered to analyse the images, Gemini also demonstrated this level of ability, but with added safety contexts.
While most LLMs provided broadly similar recommendations for “problem areas,” the level of detail and numerical precision in their suggested pump setting changes varied significantly. ChatGPT, Claude, and DeepSeek offered quite specific numerical adjustments for basal rates, ISF, and ICR, whereas Copilot was more general. This highlights that users might get very different levels of actionable advice depending on the model they choose, and may not be aware what the LLM is doing.
The most significant takeaway from this entire experiment revolves around patient safety:
Gemini’s identification of what it thought was a critically misconfigured ISF of 0.1578 and its emphatic warning was a standout moment, showcasing the potential for AI to act as an invaluable safety net, flagging potentially dangerous settings that a human might overlook, but also highlighted the risk hallucinations in AI data interpretation, even when prompted.
Similarly, DeepSeek’s flagging of an unusually long Insulin Duration of Action (DIA) and its recommendation to shorten it is a relevant clinical insight, but also highlighted that it hadn’t got the data associated with why longer DIA might be relevant.
Add to this the inability of any of the models to identify compression low patterns and flag them, by potentially using different sites for CGM sensors, instead offering mitigation by way of treatment changes, and we start to see the limitations as to the “knowledge” of the systems.
The sheer variability in specific numerical recommendations across different LLMs for the same underlying data underscores the absolute necessity of human medical oversight. If AI systems offer conflicting or significantly different adjustments for something as critical as insulin pump settings, they cannot be used as definitive guides. They are best viewed as sophisticated pattern recognition and suggestion engines.
Practical limitations like Perplexity’s upload limits or Gemini’s initial image blindness, even if temporary, highlight that the real-world usability of these tools can be hindered by technical constraints.
The bottom line for anyone managing diabetes and considering using AI for data analysis is clear: LLMs hold immense promise as supplementary tools. They can meticulously review vast amounts of glucose data, identify subtle patterns, and suggest areas for discussion with your healthcare team. They can provide a valuable “second opinion” or help formulate questions for your next appointment.
However, they are not, and should not be, a substitute for your endocrinologist or certified diabetes educator. Any suggested changes, especially to critical pump settings, must be reviewed, validated, and implemented under professional medical guidance. Your individual physiology, lifestyle, and other health conditions are complex factors that no AI, however advanced, can fully comprehend or safely manage on its own.
The future of AI in diabetes management is undoubtedly exciting, offering hope for more personalized and efficient care. But for now, as always:
Stay safe, stay healthy, and always double-check with your medical professional!
If you’re interested in the details, the responses are attached here:
Grok Analysis Perplexity Analysis Claude Analysis Gemini Analysis Deepseek Anaysis ChatGPT Analysis Copilot Analysis
Leave a Reply