Lies, damned lies and statistics. The art of the CGM accuracy study.

How do you compare two different CGMs? It seems like a stupid question. It’s obvious. You look at the Mean Absolute Relative Difference (MARD) of the two and the lower one is better. Right?

In theory. And yet, you might come away disappointed.

Everyone who uses CGM has asked themselves “How are my readings so different from the reported accuracy?” in some way.

Meanwhile, at ATTD2022 this year, Boris Kovatchev stated that we shouldn’t be comparing AID systems that use different CGM systems. Now why might he state that?

What follows is a look at the world of CGM accuracy studies, how they might be run with potentially different outcomes in mind and what that means for the end user, then a comparison across some existing studies. A key takeaway is that MARD alone is not good enough to compare systems.

And if that proves too much to read, then the message is, “We need a consensus on CGM accuracy studies”. Without one, systems like Aidex can end up on the same health care payer system as Dexcom due to similar “accuracy” when a detailed review of the studies would suggest otherwise.

How is “Accuracy” assessed?

If you want to find out the accuracy of a system, as a manufacturer, you arrange a clinical study that involves a reasonable sample of people, enough to get the required number of data points to “power” your analysis. This often requires that participants wear multiple sensors in order to scale up the data.

Then you sample, on a regular basis (say 3 to 5 times over the course of the sensor wear) in a clinic, taking intravenous samples every fifteen minutes over a period of around 8-12 hours and analyzing them using the gold standard Yellow Springs Instruments analyzer.

In some cases, you might also compare to fingersticks, to see how your system will be observed out in the wild and what users are likely to see.

Seems simple, doesn’t it?

Comparing accuracy studies

Firstly, understanding the basis of the accuracy study for the sensor you’re looking at the summary of is not always easy. Accuracy studies are not always published in an open fashion. Abbott did so with the Libre, along with Dexcom for the G6 and Medtrum for one of their sensors, but finding the definitive studies for Aidex brings you up against a paywall, which makes comparing data challenging.

Secondly, clear information on many factors of the study is often missing.

In this article in Diabetes and Vascular Disease Research, the authors highlighted all the points that might affect the outcomes of a study. The table they published, shown below, is very helpful in this respect:

Source DOI: 10.1177/1479164118756240

Why does this matter?

If the methods used to capture the data and manage the participants are different, it can affect variability of glucose levels and thus reported values.

Similarly, selecting a population with the potential for lower variability may present a set of data that appears to have a lower MARD.

Roll in variability about the devices themselves, and you have a potential world of pain for a reader to sift through.

Lower glucose variability? How does that help?

At a very basic level, when glucose levels are rising or falling rapidly, insterstitial fluid glucose concentrations tend to lag venous levels more significantly than during periods of more stable glucose levels. (Pleus, S, Schoemaker, M, Morgenstern, K. Rate-of-change dependence of the performance of two CGM systems during induced glucose swings. J Diabetes Sci Technol 2015; 9: 801–807). Given this scenario, reducing the fluctuations in glucose levels during your study would, potentially, demonstrate a greater “accuracy”.

Additionally, at higher and lower glucose concentrations, CGM sensors are generally less likely to report “true” results. (Kropff, J, Bruttomesso, D, Doll, W. Accuracy of two continuous glucose monitoring systems: a head-to-head comparison under clinical research centre and daily life conditions. Diabetes Obes Metab 2015; 17: 343–349 & Rodbard, D. Characterizing accuracy and precision of glucose sensors and meters. J Diabetes Sci Technol 2014; 8: 980–985). Variability of accuracy over sensor life is also a known issue. (https://www.liebertpub.com/doi/10.1089/dia.2019.0262?url_ver=Z39.88-2003&rfr_id=ori%3Arid%3Acrossref.org&rfr_dat=cr_pub++0pubmed)

Given this set of data, how might we shape a study to give us a better MARD for our new device?

What would a low MARD study look like?

Let’s start with glucose variability. We could manage this in a number of ways:

  • Select a population with lower glucose variability, e.g. a higher percentage of participants with non-insulin treated type 2 diabetes.
  • Select a population that is less likely to have hypoglycemia, e.g. a high percentage of non-insulin dependent participants with type 2 diabetes
  • Limit the number of participants with type 1
  • Ensure that at least some of your participants are in an environment where activity and food can be managed to reduce variability

That seems like quite a cynical ploy, but as we’ll see later, it may be quite crucial in determining outcomes.

But what about device dependent variables? Once again, there are ways and means of managing this:

  • Where sensors fail prior to the end of the sensor life, include the data anyway

Where does this get us? Well it suggests that in any MARD value published by a CGM manufacturer, we should be checking their methods.

Reviewing accuracy studies

Let’s take a look at a few. The four I’ve been able to uncover are from Abbott for the original Libre, MicroTech for the Aidex, Dexcom G6 and another from Medtrum. For each of these we’ll look at the population, the method and the handling of data from “lost” sensors, as well as the number of data points. Links or PDFs are provided, where possible, for all four studies.

Dexcom G6

Dexcom’s G6 accuracy study is one of the most comprehensive I’ve seen. In terms of the points raised above:

Study link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6110124/

  • Participants: 262
  • Number of in-clinic sessions: 1 or 3
  • Paired YSI data points: 21,569
  • Method: Home and in-clinic use. Blood glucose manipulation when in clinic to obtain a range of data points
  • Population: 260 participants with type 1; 2 participants with type 2
  • Handling of data from failed sensors: unclear. Appears included
  • Hypo pairs: (below 70mg/dl) 2,596 (12%)
  • Reported MARD: 9.9% adults; 10.1% juveniles

Additionally, with the glucose manipulation in clinic, the accuracy around rate of change was observed.

Abbott Freestyle Libre

Study link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8875061/

  • Participants: 75 of which 72 were included
  • Number of in-clinic sessions: 3
  • Paired YSI data points: 12,172 (13,195 capillary points)
  • Method: Home and in-clinic use. No Blood glucose manipulation when in clinic
  • Population: 33 participants using insulin; 39 participants not using insulin
  • Handling of data from failed sensors: unclear. Appears included
  • Hypo pairs: (classified as sub-100mg/dl) 4,138 (16%)
  • Reported MARD: 11.4%

Although not as comprehensive as the Dexcom study, this one includes a significant number of data points. It also attempts to compare the Libre with the home standard, fingersticks, to provide a better reference for users.

Medtrum

Study link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5835467/

  • Participants: 63 of which 60 provided valid data
  • Number of in-clinic sessions: 1
  • Paired YSI data points: 1,678
  • Method: Home and in-clinic use. No Blood glucose manipulation when in clinic
  • Population: 10 participants with type one; 53 participants with type two
  • Handling of data from failed sensors: unclear. Appears included
  • Hypo pairs: 546 (32.5%)
  • Reported MARD: 9.1% ± 8.7%

A much smaller study, with only one clinical session for YSI analysis.

Aidex

Study link: paywalled.

  • Participants: 120, of which 115 were used
  • Number of in-clinic sessions: 1 per participant on one of three days (randomly selected)
  • Paired YSI data points: 14,586
  • Method: Home and in-clinic use. 50% of participants were hospitalised during the study
  • Population: 14 participants with type one; 101 participants with type two
  • Handling of data from failed sensors: unclear. Appears included
  • Hypo pairs: 92 (0.6%)
  • Reported MARD: 9.08% Venous; 10.1% Capillary

What’s perhaps of most interest in this study is that 4 sensors were applied per person, so although there are a lot of data points, they are heavily skewed by hospitalisation and population.

Comparing the studies

As we can see, across all four of these studies, there are some major differences. The Dexcom study comes across as the gold standard, with a population that’s most likely to herald a high level of glucose variability.

Some of the others on the list are perhaps a little less golden. When we described our list of characteristics, who would have anticipated how many of these studies fulfilled as many of those as they do.

Both the Aidex and Medtrum studies have a very high proportion of populations with a potential for lower glycaemic variability. One of them also has half the population hospitalised.

Another item of note is the number of days, and when those days are, for YSI comparison. The gold standard is that all participants are tested at varying points through the sensor lifecycle. The “mud” standard is once, somewhere in the middle.

We can also look at the amount of results in a hypoglycemic range. Once again, one study really stands out as having a tiny amount of these, thus limiting potential inaccuracies.

What can we take from this? Ultimately, these studies are not really comparable. The populations, methods and study settings often differ dramatically. As a result, the data collected doesn’t allow us to properly compare like for like. Even the two which appear to be of a higher standard have significantly different user populations.

As a result, we can’t really be sure that what we’re being told reflects the reality of living with these systems.

Conclusions

The major differences in the studies that provide accuracy data will be a key driver in why those using these systems consider that they are inaccurate and “don’t work”.

It’s more likely that the accuracy study that they’re referring to didn’t reflect their use of the product.

And in this context, the Medtrum and Aidex studies stand head and shoulders above the rest, with both focusing mainly on participants with lower glycemic variability and in Aidex case, hospitalised participants. As a result, they are unlikely to reflect what happens in a type 1 user.

Therein lies the issue. To be able to form a fair assessment of a product, and whether it should be paid for by a health system, it should be possible to compare across studies.

What’s clear from this is that an international consensus on CGM accuracy assessments is rapidly needed, especially with the onset of numerous cheap CGMs that state high accuracy but don’t necessarily live up to their billing.

I’ll give the last word to the article in Diabetes and Vascular Diseases.

When comparing devices, the only way to minimise or eliminate factors that contribute to non-concordance of each system is to conduct a head-to-head comparison when different ISF devices are worn simultaneously by the same subject and an appropriate, and identical, reference method is used.


Footnote

After publishing this article, I was made aware of the IFCC working group on CGM who are looking to address some of these issues. Their details can be found at https://www.ifcc.org/ifcc-scientific-division/sd-working-groups/wg-cgm/https://www.ifcc.org/ifcc-scientific-division/sd-working-groups/wg-cgm/

11 Comments

  1. This is fabulous summary – thank you ever so much 🙂

    Your analysis of the importance of glucose variability and therefore the effect of different proportions of T1D / T2D patients in the MARD studies is very clear.

    I had instinctively assumed that the methodology for reporting MARD would be standardised …

    I wonder what a properly controlled head to head comparison would show. And what a medical statistician would make of this!

    With best wishes,

    Ian

  2. I personally think they need to move away from using a veinous blood reference. In the end, the glucose is consumed in the cells, which are not in the bloodstream. So in terms of physical consequences of low glucose, the ISF is where it’s having the effect. The lag in glucose between blood and ISF means you’re effectively getting misleading information about accuracy, which would make for a larger MARD. This could mask the true accuracy of a device.

    If they could devise a reference using ISF (not sure how that would even be collected), then they would remove the lag issue, and get a much more accurate picture of the accuracy of the actual device. Otherwise I believe they should evaluate the actual lag of a device in a subject and actually adjust for that.

    Leaving the effect of lag in, is a false picture of inaccuracy, targetting a reference that doesn’t even have the effect we’re interested in.

    Yes, adjusting it will mean the MARD looks better when compared to a blood reference. My question is: is this meaningful? We’re fixated on blood, but the ISF is where the glucose is consumed, and where lack of glucose causes clinical symptoms of hypo or hyper-glycaemia? I imagine there would be variability in ISF glucose in different parts of the body as well.

    I’d like to see numbers for only static glucose (so no lag effect), and different MARD values quoted in different BGL ranges. Hypo I’m more interested in than hyper. I don’t want excellent hyper accuracy to mask issues in the hypo range, because we are conditioned to accept a MARD of 9% (which is actually awful). 9% is big enough to hide a multitude of sins.

    • If you review the studies, most of them do adjust for the lag, so this shouldn’t really be considered an issue. Again, the Dexcom study went further and applied a mechanism to ensure that the venous blood approximated capillary tests.

      One of the key issues identified in the paper in Diabetes and Vascular disease is the difficulty with extracting ISF to use a YSI on it.

      Although I haven’t quoted them, there are multiple studies that discuss MARD at varying glucose levels. Definitely vektth a dig.

      • You state that the Libre study had “No Blood glucose manipulation when in clinic”.

        However, your link says participants “underwent supervised glycemic manipulation during in-clinic sessions”. Why is this not similar to Dexcom?

        In the FreeStyle Libre 3 study that got 7.6% MARD in Adults, we can see that this was achieved because “No glycemic challenges were performed during these sessions”

        https://diabetesjournals.org/diabetes/article/71/Supplement_1/76-LB/145805/76-LB-Performance-of-FreeStyle-Libre-3-System?searchresult=1

        Dexcom needs to be called out for subtracting the 5 minute interval between transmissions from their lag measurements. They realign the data so that they are comparing the lag from the time of the transmission and not including any of the lengthy interval between transmissions which in the real world make the lag much worse.

    • Your brain needs glucose, and it gets it from your blood – not your interstitial fluid. It’s extremely power hungry – using up to 25% of your total energy consumption. But it’s worse than that. Muscles can use a lot of energy to when in use, but they have their own energy storage like a battery so they don’t necessarily need to draw on glucose.

      Blood glucose levels are are much more important indicator than interstitial fluid.

  3. One thing that needs to be included in the studies is the use of the same glucose meter. Too often glucose meters are considered the end-all of trustworthiness. Problem is, there is a great deal of inaccuracy in home meters. As an example, I cannot use the One Touch meters (the one coverd by my health plan) it has been shown time and time again to read higher than actual on people with chronic anemia. That in etself is an issue.

    • Most studies don’t use home meter readings to provide an accuracy figure.

      Those that do, tend to do so as an adjunct to venous blood, and normally mandate a single meter is used for the entire study by all participants.

  4. Interesting and surprised how well the Medtrum appeared to perform. I had one a couple of years ago and it was not accurate. I ended up running it and a Libre simultaneously and comparing the results with the One-Touch finger prick strip for reference. This happenened because the Medtrum was giving me low alerts which did not match how I felt or what showed on the blood tests. There were differncences between the Libre and bloods but less that caused concern than with with the Medtrum.

    • That’s the benefit of a “well designed accuracy study”. My experience of Medtrum was equally poor when I used it in a side by side test. I ended up with a MARD Vs capillary of around 19%.

1 Trackback / Pingback

  1. Savvy Updates 6/8/22: More on Dexcom, CGM Accuracy Study, CGM Alarm Fatigue, Prior Authorizations Reform (NOT) | The Savvy Diabetic

Leave a Reply to admin Cancel reply

Your email address will not be published.


*