As a result of increased use of third party add-on transmitters to the Libre system to convert them into CGM, and the use of these with DIY APS systems, a case study to look at accuracy and the effects of calibration on accuracy was undertaken.
Libre was used with MiaoMiao and xDrip to capture two datasets using xDrip auto-calibration and twice daily calibration. The data was uploaded to NightScout for storage and analysis was undertaken using the Diabetes Technology Society Surveillance Error Grid, Modified Bland-Altman plot and standard MARD, Bias and Standard Deviation analyses.
The key result was that when a sensor produced data using the Abbott algorithm that was widely variable from blood tests, it tended to have more variance in the data produced by xDrip than when a sensor behaved well. In addition, as the sensor aged, variance in the final two days on twice daily calibration was greater. Finally, less frequent sensor calibration appeared to result in better accuracy and a negative bias, however confounding factors early in the first calibration cycle may have affected these results.
While frequency of calibration appears to play a part in the accuracy of the data produced using third party add-ons to the Libre, sensor age and sensor native behaviour appear to play more important roles. Users need to be aware of this when they start to use a third party add-on system.
As the use of third party add-on transmitters for the Freestyle Libre to convert them into a form of CGM has increased, and as this data has been incorporated into DIY Hybrid Closed Loops (APS), stories of people being hospitalised due to the use of these systems have circulated. At question is the quality of the data being produced and fed into the APS and whether it can be relied on. This is likely to be a combination of the sensors themselves and the calibration approach taken.
Within the Facebook support groups for these systems, a recommended calibration structure has been suggested to try and maintain better data, which suggests frequency of calibration and when and how too calibrate.
This case study uses n=1 data to review how much the output from the Libre with a MiaoMiao transmitter and xDrip software varies compared to blood glucose data and also to Dexcom G6 data under two different calibration models, and whether there are any patterns that can be identified to provide better advice to users of these systems.
In order to look at how the Libre with MiaoMiao and xDrip (Libre) faired, the following configuration was used:
- Libre with MiaoMiao
- xDrip on a standalone phone
- Dexcom G6 using Native settings, started with the non-calibration, coded function. The Dexcom sensor was changed at the end of it’s manufacturer’s recommended life, i.e. 10 days.
- Upload to a single NightScout instance to capture all sensor and manual blood glucose inputs, from two separate collection devices
- All manual blood tests were undertaken on a Contour Next one, which is widely regarded as being the most “accurate” of the available home blood testing meters
For the first seven days of use, the Libre was calibrated using the xDrip “auto-calibration” function, which is supposed to select the most appropriate time to calibrate a sensor, based on a number of factors, including time since last calibration and variation in the glucose entries.
For the second seven days of use, all the previous calibrations were deleted using xDrip’s function for a fake sensor stop. The Libre was then calibrated on an approximately twice daily basis, adhering to the guidance for best practice provided within the Spike facebook group.
During both of these period, multiple manual blood tests were taken that were used only for comparison of data, rather than for calibration. During week one there were 44 manual blood tests. During week two there were 48 manual blood tests.
The data was extracted from the Entries table of the NightScout database, and for each manual blood glucose reading taken, the immediately preceding sensor values were taken for each of the Libre and the G6. This method was used to ensure that any calibration entries for the Libre did not skew the outcomes of a post-manual blood glucose test where they were being used for calibration.
Analysis was then undertaken to determine the MARD and Bias from manual blood test for each of the Libre and the G6, and a standard deviation of MARD values was derived. Surveillance Error Grids and modified Bland-Altman plots were also generated using the Diabetes Technology Society tools.
In addition, data from a previous test of the Libre was used to verify observations from this test, with similar analysis undertaken.
Days one to seven
The results for days one to seven are shown below. Table 1 shows the calculated values for MARD, Bias and Standard Deviation.
On days one and two of use of the Libre sensor, the scans with the Libre reader showed significant difference from blood tests and the Dexcom G6. After this, the sensor “woke up” and aligned more closely with the blood tests.
Figure 2 is the Surveillance Error Grid (SEG) for days 1-7 and figure 3, the modified Bland-Altman plot for the same period.
Days one and two of using this particular sensor appear to show significantly greater variation in values during the period that the scanner was showing disparate data from the manual blood tests. This also shows as relatively wide dispersion in the SEG.
There is a single outlier showing on the SEG, which in the detail of the data showed a Libre value that was approximately double that of the blood test and Dexcom G6 comparators, with an 89% overstatement of the glucose level. It’s worth noting that during the first two days, the mean bias of the Libre was 12% on day one and 17% on day two.
The following days saw the Libre scans align more closely with manual tests and the measures of accuracy become closer to zero.
Days eight to fourteen
The results for days eight to fourteen are shown below. Table 2 shows the calculated values for MARD, Bias and Standard Deviation.
Figure 4 is the Surveillance Error Grid for days 8-14 and figure 5 the modified Bland-Altman plot for the same period.
During days eight to fourteen, the overall MARD from blood was closer than that of days one to seven. As experienced with Abbott’s own data, as the sensor got older, the values for MARD became larger, and this is clear on days 12 to 14. During this second period, the bias on the Libre was to understate rather than overstate glucose levels, unlike the first seven days where they were overstated.
Dexcom G6 Data
For comparison, the Dexcom G6 data is shown below.
As two sensors were used, tabulated data is shown for both, but due to the short period of use of the second sensor, the Surveillance Error Grid and modified Bland-Altman plot were only generated for the first sensor.
Overall, the G6 showed lower variation from blood, however, the data suggests that on a number of days it was not significantly different from the Libre. The standard deviation for the G6 does appear significantly lower than that of the Libre throughout the Libre testing.
December Libre test days one to six
In addition, data was generated from the first six days of a Libre that was previously trialled in December. This is included as the sensor was known to be particularly inaccurate when using the Abbott LibreLink application to scan readings.
A similar dataset for the December sensor was analysed over days one to six of its use, as the trial at that time was not intended for this study, and involved switching the algorithm in use on day six.
The data generated from this sensor tends to show a greater variance in the results, with all three indicators having a wide dispersion. This is reflected in both the SEG and the Bland-Altman plot.
Given the concerns relating to use of the Libre in DIY closed loop systems, the questions that arose looked at the calibration techniques people are using (as referenced in this article) and also whether the sensors themselves had a part to play in the outcomes.
Looking at the data in this n=1 sample, there appears to be a correlation between frequency of calibration and variation in the readings produced by xDrip. Week one of the Libre sensor shows considerably more variation overall than week two, however, it’s worth taking into account the first two days, where the variation is significantly worse.
Overall during the first seven days, the MARD for the sensor was 13.6%, however, if the first two days are ignored, this drops to 11.9%, which while still higher than that of the second seven days, is a significant reduction, and suggests that well timed calibrations (assuming xDrip is timing the calibrations well) shouldn’t play too great a part in the outcomes.
The second seven days shows a less frequent calibration strategy, with an average of 1.9 calibrations per day. This would appear to hold up the hypothesis that less frequent calibration results in overall better results, however, it’s worth highlighting the last three days of data. Whilst one had four calibrations, the other two didn’t. Whether the calibrations on day 12 affected days 13 and 14 is difficult to say, as accuracy using Abbott’s algorithm is known to get worse towards the end of the sensor, but both appear to have a larger MARD.
In comparison to the Dexcom, the MARD doesn’t seem to be that far off, and in some cases is better, whereas the G6 bias and standard deviation seem to be better. Of note is that the output from the second non-calibrated G6 sensor on days 1-4 appears to be worse.
The other thing of note in week two is that the bias of the system with less frequent calibrations has become negative, compared with a very positive bias in the first week. The standard deviation in the second week is also, generally, less.
When using a CGM system for looping, in general it’s safer to have a negative bias on the data than a positive one as it reduces the risk of overdosing of insulin, and the more widely spaced calibrations on xDrip seemed to tend in this direction, although, it’s possible that even when trying to maintain good calibration standards, there were one or two erroneous ones.
And this brings me on to the comparator data set from December. The frequency of calibration was not too far different from that seen in week two of the initial dataset, however the outcomes appear to be very different. All three indicators suggested a less accurate outcome, even though the calibration model is not too dissimilar, and the number of points compared (48) is similar to the initial dataset.
This raises the question as to what the compounding factors are in accuracy of xDrip/MiaoMiao/Libre.
Sensor native accuracy and age
The key thing in both these data sets is that when the underlying native sensor data is significantly out, using xDrip appears to result in greater divergence.
The first two days of the first dataset show a significant variance, which was during a period of poor performance of the Abbott algorithm, and I suspect, sensor. Likewise, the data from December was also created using a sensor that consistently read well below manual blood glucose readings throughout its life. The resultant accuracy of xDrip, while being better than nothing, was still not especially good.
The following two figures attempt to capture an idea of how variance occurs after a calibration. On each figure, the grey area shows calibrations and the orange line the variance of the Libre data from blood. A one value of the grey are is a calibration. A two value is a blood test with no calibration.
It’s possible to see in both figures that outside the calibration points there appear to be significant variance from blood testing, which users should be aware of. It does appear to show that although in the results, the MARD in the period with less calibration appears better, there seems to be noticeable drift in the data between calibrations. When this is compared to the variation of the Dexcom, it is only when the sensor is behaving poorly that it Libre is significantly worse than the uncalibrated Dexcom. This is shown in Figure 15.
Only two cases have really been covered in this case study. That of automated calibration and that of approximately twice daily calibration. That leaves once daily calibration and less frequent calibration still to be investigated.
It appears that on a well behaved sensor, the less frequent calibration looks to work effectively, but it is harder to comment on a poorly behaved sensor, as those seem to produce poor results with the xDrip/MiaoMiao combination, regardless of the number of calibrations.
The primary purpose of this case study was to address questions about the accuracy, or otherwise, of the Libre, MiaoMiao and xDrip combination, and to understand how much of an impact calibration has. As an n=1 study it is not powered enough to draw widespread conclusions.
Having said that, there appears to be one clear conclusion.
Libre sensors that behave well on the Abbott software and reader generally perform well on the MiaoMiao/xDrip combination. Those that produce very low numbers compared to blood on the Abbott algorithm seem to perform less effectively, and this can include wild varying outputs which are substantially higher than the actual blood value.
The other question related to the effects of calibration on accuracy. This is less clear cut, but it’s fair to say that the period using auto-calibration in xDrip, excluding the first two days of the sensor, generated a greater MARD than the period with calibrations taking place less than twice per day. There is always the possibility that the early calibrations on days one and two skewed this data though.
With this in mind, just off this dataset, calibration once or twice per day appears to be more effective, but with the caveat mentioned above and also bearing in mind that a poorly performing sensor appears to still produce relatively variable results even with a better calibration approach.In addition, as the sensor ages, the results also become less close to the comparative blood tests.
As a result, a further case study will be run looking at once daily calibration and more frequent calibration over a well behaving sensor.
Comment is often made about the variability of the manual glucose monitor used. For this particular study, a Bayer Contour Next meter was used as the default across all the glucose testing due to its status as the top rated meter in the Diabetes Technology Society program.
As referenced in the article,
The Quantitative Relationship Between ISO 15197 Accuracy Criteria and Mean Absolute Relative Difference (MARD) in the Evaluation of Analytical Performance of Self-Monitoring of Blood Glucose (SMBG) Systems
MARD for a blood glucose meter would need to fall within 3.25%-5.25% to guarantee that the meter met the ISO standard, which suggests that fingerpick blood test meters generally produce considerably more accurate results than CGM systems.
Really interesting case study Tim – thank you for your research on this and for everything you do.
One very minor thing to point out in the Spike Calibration guidance that I posted (and I know you have picked it up in your research, but as it’s not mentioned specifically at the point you’ve linked to it I thought it’s worth mentioning), is that part of the best practice is not to calibrate too often. I would consider once a day, or once every 48 hours as way too often for me as sooner or later it’s likely to result in inaccuracy. For me, I calibrate on day 1 as is mandatory (although if not optimal I delete as soon as I can and redo when conditions are optimal), day 2, day 4… and then I check after that every few days, or if I suspect there may be an issue, but I rarely have to calibrate again.
The only variance to this is that I now have in Spike the ‘non-fixed calibration slopes’ function enabled. This is only relevant to Libre Spike users (it’s on as a default for Dex/Spike users) and means that multi-point calibrations are taken into account rather than just a single point offset. (I need to update my post to reflect that function should be enabled). That then means if I did the first calibration or two at approximately the same range, I’ll seek a point when I’m higher or lower than that but still level and stable, then calibrate again to capture that point in my calibration regime.
I’d be really interested in the results of a week 1 & 2 case study on the basis of what I’ve outlined in the guide and above, using more strict level and stability criteria than Spike’s (and I assume xDrip+’s) own ‘calibrate now’ alarm does, and see how that compares.
Thanks again for this and all you do Tim. It’s quite an honour to have you link to my post. Looking forward to the London Looping day hopefully being rescheduled and being able to thank you in person (he says praying you don’t reschedule on a day where there’s a Watford home match).
Thanks for the feedback Rob. As I mentioned in the conclusions, a further data collection is needed to look at once a day (or less) calibration to determine what that shows.
I question what the outcome would be with less frequent calibration when the sensor is not performing well under the native algorithm, as I’d suspect that I’d see a fair amount of drift.
Regardless, it’s another item to check, and as mentioned in the article, the auto-calibration appeared to not produce optimal results, but that was caveated with questionable day 1 and 2 performance.
In spite of the above, it raises the question in terms of ease of use and safety as to how effective these solutions are if they’re so susceptible to variation in calibration.
Tim, how did you enable the OOP algorithm using MiaoMiao and xDrip+? I have a MiaoMiao2 with a Canadian Libre sensor, but I haven’t found clear instructions to make this work. Thanks!
I would agree with Rob’s strategy as best approach to know when and how much the MM/Spike combo is adrift. As much as accuracy can be a challenge the ability to glance at my watch, or augment the current BG trend with IOB/COB I think more than offsets the accuracy negatives compared to the basic Libre App/Reader.
I guess the question is how you use it. Given the accuracy challenges, would you be comfortable with an automated insulin delivery system using it?
No, as with the Libre readings, dosing decisions are too dependent on my assessment of how much I trust the readings. As you noted there seem to be good sensors that calibrate well and are no hassle and then you get the ones that are a real pain and always needing to be wiped/recalibrated.
This is really necessary data you show. Miaomiao and xDrip+ are still more or less “experimental” to users. I have been using MM+Libre now for a year, calibrate once daily. I take the blood glucose reading from xDrip+, another with Libre reader and a third with fingerprick. I must say I am relatively satisfied with the result, and feel that my control of T1DM has greatly increased because of these systems. The instructions for both MM and xDrip+ are still quite deficient and this kind of data you display is very necessary.
Very interesting findings. I am using BluCon since 2017 with Ambrosia app LinkBluCon app on iPhone. I recorded readings over a period of 60 days in late 2017(Calibrating daily), then in 2018(Calibrating daily) and most recently in June/July 2019(Calibrating once in 2 days) and compared them with Libre Reader and finger prick readings. I noticed the average over a period of 60 days is same in Libre Reader and LinkBluCon app.