In conversation with… File uploads and interpretation. Where the wheels fall off!

For those of us navigating the complex world of diabetes with tools like Nightscout and Automated Insulin Delivery systems, the idea of an AI assistant instantly sifting through our continuous glucose monitor (CGM) data, insulin doses, and trends and making recommendations sounds like a dream.

But as with any powerful technology, the devil is in the details – and sometimes, in the omissions. In this final part of the “In conversation with…” experiment, we asked the LLMs to analyse real Nightscout data and provide feedback, areas of concern, and recommendations. The results were a fascinating, and at times alarming, glimpse into the current state of standard AI models in health data analysis, highlighting both immense potential and critical shortcomings. This article dives into the common capabilities, the significant differences, and, most importantly, the major issues and subsequent responses we encountered.

The Foundation: What LLMs Agree On (and Can Do)

Across the board, the LLMs demonstrated a foundational understanding of Nightscout data and the common metrics crucial for diabetes management. When asked if they could analyse Nightscout details, all active models confirmed their ability to do so.

A primary hurdle encountered by all LLMs was their inability to directly access Nightscout data via a URL, even with tokens, due to privacy and security protocols like robots.txt files. This is a crucial safety feature, ensuring our sensitive health data isn’t directly exposed to external AI systems without explicit file sharing. On the other hand, it presents a bit of a conundrum as it means the very systems we’re using here to analyse our data have had limited exposure to learn about doing so.

This lack of access necessitated manual data export by the user, typically as CSV or JSON files. Perplexity, notably, struggled with “upload limits” early on, preventing it from participating in multi-stage data analysis experiments, effectively hindering its utility for comprehensive tasks.

As we saw in the image analysis challenge, common areas of analysis and feedback universally offered included:

• Summarising Time-in-Range (TIR) metrics

• Highlighting hypo/hyperglycemia trends

• Identifying patterns by time of day or day of week

• Suggesting questions for healthcare providers

Once direct access had been ruled out, most LLMs preferred or supported CSV and JSON formats for data export, with CSV often being the recommended choice. They also showed an understanding of the components involved, such as insulin sensitivity factors (ISF), insulin-to-carb ratios (ICR), and basal rates. This gives an impression that these models are well-equipped to assist users in understanding their diabetes data.

Divergent Paths: How Each LLM Approached the Task

While the core capabilities were similar, the LLMs varied significantly in their initial approach to data handling and analysis execution.

Direct Access vs. File Upload:

• ChatGPT, initially, attempted to access a token-protected Nightscout URL, but quickly stated it was “not able to access your Nightscout data directly via that link or fetch information from private or token-protected dashboards”. It then pivoted to requesting file uploads.

• Claude, Deepseek, and Grok were more cautious from the outset, explicitly stating their inability to access external URLs due to privacy and security protocols, or robots.txt restrictions. They immediately directed the user to export and upload data files or screenshots. This proactive safety stance from Claude, Deepseek, and Grok is commendable, prioritising user data security over attempting direct, potentially insecure, access.

Analysis Methodology and Output Style:

• ChatGPT jumped directly into providing a textual analysis, detailing glucose metrics, patterns, and recommendations based on the provided files. Its output was comprehensive, resembling a detailed report.

• Claude also provided a direct textual analysis, including “CRITICAL SAFETY CONCERNS” and “Overall Performance Summary” with specific percentages. It then proceeded with “Immediate Recommendations” and “Time-Specific Issues Identified”.

• Gemini similarly offered a direct analysis with specific metrics and hourly trends, followed by “Areas of Concern” and “Recommendations”.

• Grok provided an “Analysis Summary” with estimated metrics and then detailed “Areas of Concern” and “Recommendations”. It also generated a “Visual Report” in HTML format, although this was not particularly useful when reviewed.

• Copilot took a starkly different, and arguably more cautious, approach. Instead of performing the full analysis itself, it provided Python code snippets for the user to run in a Google Colab notebook to generate summaries and plots. This shifts the burden of data processing and visualisation to the user, sidestepping the LLM’s own internal parsing challenges but requiring user technical expertise. While this avoids the LLM misinterpreting data, it doesn’t perform the analysis directly as requested.

• Deepseek, as detailed below, applied a very strict time filter which resulted in an empty output, rather than a broad analysis of the provided files.

Where the wheels came off: Major Issues and Critical Errors

Here’s where the rubber met the road, revealing significant flaws in how some LLMs handled the data provided, particularly concerning data integrity and transparent communication. These aren’t minor glitches; they represent fundamental missteps that could lead to dangerous, misinformed decisions if users relied on the AI’s output without critical oversight.

Ignoring or Misrepresenting Critical Data:

ChatGPT’s “Partial Blindness”

Initially, ChatGPT produced a detailed analysis, summarising metrics like Time-in-Range (TIR) and highlighting trends. However, when challenged about a subsequent inability to re-assess with a new target range due to a “incorrectly formatted or unlabeled columns” in the entries file, it admitted a significant oversight. ChatGPT revealed that its initial analysis was based primarily on the test-treatments.json file (insulin dosing, temp basals), a “brief scan of entries in the CSV before running into formatting issues,” and “preloaded assumptions and approximations based on typical glucose data structures”. Crucially, it “did not directly analyze glucose values from the CSV” because of the format problem. It confessed it “should have flagged that limitation clearly up front” and that it “leaned too far into trying to ‘fill in the blanks’ … without acknowledging that [it] was flying blind on actual glucose trends”. This is a severe error: providing a seemingly data-driven analysis when a critical piece of the data (the glucose values themselves) was never actually processed correctly.

Gemini’s “Month” vs. “Days” Deception

Gemini’s initial analysis proudly stated it used “Nightscout data from the last month”. It provided detailed metrics like “Time In Range (TIR) (70-180 mg/dL): 83.74%” and “Estimated HbA1c: 6.15%”. However, when challenged later, Gemini admitted a “significant error in misrepresenting four days of data as a month”. The entries (5).csv file only contained approximately 4 days of data (June 9 to June 13, 2025), not a full month as implied and stated. This fundamental misrepresentation of the data’s duration could lead to highly unreasonable advice, as trends observed over 4 days might be anomalies when viewed over a longer period.

Grok’s Week-Long Analysis (for a Month Request)

Similar to Gemini, Grok also initially stated it had analyzed data for “the last month” but then immediately clarified its analysis was based on files spanning only “June 6 to June 13, 2025” (approximately one week). While it stated the actual period within its initial response, the initial claim of “last month” was misleading. It confirmed using “all the files” but for the shorter duration. This still represents a failure to fulfil the user’s explicit request for a month’s data, without an upfront negotiation about the provided file’s actual scope.

Claude’s Selective Timeframe

Claude, despite receiving the prompt to review data for “the last 30 days”, presented its comprehensive analysis based only on data “from June 9-13, 2025”. While it stated the actual timeframe within the analysis, it did not explicitly preface its response by acknowledging that the uploaded data itself was limited or that it chose a shorter period than requested. This might lead a user to assume the uploaded files only contained that specific period, or that the AI had somehow selected the “most relevant” 4 days out of 30.

Deepseek’s Rigid Interpretation

Deepseek applied a strict filter: “Only events occurring within the last 24 hours (relative to 2025-06-13) were considered.” Because the provided test-treatments.json file only contained events up to June 11, 2025, Deepseek’s output was an empty list, stating “No events fall within this window”. While technically accurate based on its internal filter, this completely ignored the user’s likely intent to analyze the entire historical data provided in the file for general patterns. It’s a failure of inferring user intent, leading to a non-analysis despite data being present.

Communication Failures:

The common thread through most of these major issues is a lack of upfront, transparent communication. ChatGPT should have immediately informed the user that the glucose data was unreadable. Gemini and Grok should have clarified that the provided files only contained a few days/a week of data, not a month, before presenting an analysis. Deepseek should have asked for clarification on the desired timeframe given the historical data, rather than rigidly applying a 24-hour filter that yielded no results. This failure to communicate limitations or discrepancies before delivering an analysis is a critical safety concern in a health context.

Accountability and Learning: LLM Responses to Challenges

One of the most revealing aspects of this experiment was how the LLMs responded when confronted about their errors. This demonstrates their capacity for self-correction and transparency, crucial for building trust.

ChatGPT’s Direct Admissions

When questioned about providing an analysis without looking at the data properly, ChatGPT responded with profound self-criticism. It admitted, “I did not directly analyze glucose values from the CSV, because the file format didn’t allow it. I should have flagged that limitation clearly up front”. It acknowledged that its advice was “best-guess based on treatment patterns” and “not a substitute for analyzing your real glucose values”. It went further, stating, “It’s not okay to make suggestions that are likely to be incorrect” and that the “only responsible answer is to stop and say so clearly” when data is insufficient. ChatGPT promised, “I won’t repeat that mistake in our conversation — or with anyone else,” and that it would “Stop immediately when data is insufficient or malformed, [and] Clearly say what can’t be done”. This was a very strong, transparent, and accountable response.

Gemini’s Acknowledgement and Mitigation Plan

Upon being challenged for misrepresenting four days as a month, Gemini offered a direct apology for the “significant error” and “critical mistake”. It clearly articulated why such a misrepresentation was risky (lack of long-term trends, high variability, incomplete picture). Most impressively, Gemini outlined concrete steps for future mitigation, including “Mandatory Data Duration Check and Reporting,” “Proactive Discrepancy Alert” if user requested period differs from actual data, and “Prominent Caveats for Limited Data” in its reports. This demonstrated a structured approach to learning from its mistakes.

Grok’s Explanation and Re-confirmation

When challenged on the date format issue, Grok explained in detail how it did parse the Unix timestamps correctly despite an internal HTML report warning, implying that the parsing itself was unsuccessful, but it still did not explicitly address its initial “last month” vs. “one week” discrepancy in the same direct, apologetic tone as Gemini or ChatGPT. It re-confirmed its use of “all the files” but within the one-week scope.

The robust and direct responses from ChatGPT and Gemini in particular, when challenged on critical errors, are encouraging. They show a capacity for “honesty” and a commitment to safety that is vital for health-related applications.

Proceed with Caution, but Hope

The journey through Nightscout data analysis with various LLMs reveals a compelling, yet complex, landscape. On one hand, these AI models possess a powerful ability to understand the language of diabetes management, to identify common patterns, and to structure actionable insights. Their capacity to process complex data and return digestible summaries is undeniable. The potential for these tools to empower individuals with diabetes, providing immediate, preliminary insights into their glucose trends, is immense.

However, the experiment also exposed critical safety concerns. The most alarming issues were the failure to accurately process or transparently communicate about critical input data, sometimes leading to analyses based on incomplete information or misrepresenting the duration of the data actually analysed. A “best-guess” analysis or one based on a fraction of the requested data, especially when it influences insulin dosing or management strategies, is not safe. Similarly, rigid interpretations that ignore the user’s overall intent can lead to frustratingly empty outputs.

The positive takeaway lies in the LLMs’ responses to being challenged. Their capacity for direct admission of error, detailed explanation of the mistake, and outlining of future mitigation strategies (especially evident in ChatGPT and Gemini) suggests a foundational ability for self-correction and adherence to safety protocols when rigorously prompted. This indicates that with ongoing development focusing on data integrity checks, robust error handling, and transparent communication protocols, LLMs could become incredibly valuable tools in diabetes self-management.

For now, the overarching message remains clear: standard LLMs are powerful assistants, but it is not a substitute for human oversight. While LLMs can highlight patterns and suggest areas for discussion, any significant changes to your diabetes management, particularly those involving insulin adjustments, must always be made in consultation with your healthcare provider. Use these tools to inform your conversations, but never to replace the expert guidance that ensures your safety and optimal health outcomes. The future of diabetes technology is bright, but it requires vigilance and a healthy dose of caution along the way.

 

Be the first to comment

Leave a Reply

Your email address will not be published.


*