Battle Royale – Freestyle Libre 3 and Dexcom G7 face off – The results

Combined Consensus Error Grid

As previously discussed in this article, I have been wearing the Libre 3, Dexcom G6 and G7, to compare performance across all three. 

These are the results. These are laid out showing the Consensus Error Grid for each device and all three together followed by tables providing data about the BIAS, MARD vs Fingersticks (MARDf), 20/20 performance vs fingersticks and hypo detection.

If you don’t want to read the details, then the basics are that the Libre3 proved to be the best performer of the three, with best results on nearly all of the metrics in this n=1 experiment.

First, a reminder of the approach.

Approach

The approach taken is outlined here, with the one change that the 20/20 measure uses 100mg/dl as the cut off point, rather than 80mg/dl as specified in the text. It’s worth bearing in mind that the Dexcom G7 in use here is the version sourced from Europe with the updated algorithm that was used for the accuracy study that can be found here.

To achieve this, there were some 78 fingerpricks during the G6/G7 life and 98 for the Libre3. 

As always, this is n=1 and the results should be taken that way.

Consensus Error Grids

Below are the consensus error grids for the three devices and the combined one. It’s pretty clear from this that there were difference between these three sensors. The Dexcom G6 showed the greatest dispersion, while the G7 and Libre 3 were less dispersed than the G6 but appeared to show a slightly positive bias in the G7 case and a slightly negative one in the Libre3 case.

Dexcom G6 Consensus Error Grid
Dexcom G7 Consensus Error Grid
Libre 3 Consensus Error Grid

 

When we look at all three sets of data combined on the same grid, the dispersion and location of the readings is much clearer.

Combined Consensus Error Grid

 

Other datapoints

While the above charts give a pretty clear indicator of the relationship between the fingerprick tests and the CGMs, they’re not that easy to compare to headline figures. Remember, the published MARD of the G7 vs a YSI is 8.7% in adults, and the rather questionable MARD of the L3 in their study was 7.9%.

The following tables show the MARDf and Bias measured in this experiment, and the day by day MARDf, to give an indication of how things progressed over the life of the sensors.

Now you’ll note that in this case, there are two MARDf and Bias values given. A total value and a post calibration value. This is because, during the early life of both the G6 and G7, I found that there was significant variation between CGM readings and the fingerprick tests. The fingerprick readers had both been tested with control solution, so calibration of the CGMs was required. This took place on Day 4 with the G7 and day 3 with the G6, according to the number of readings that were greater than 20% from the blood tests. Both the G6 and G7 needed calibrating as they were producing values that were a long way from the blood values.

Other datapoints that are of interest relate to hypo detection and accuracy. Here we have three additional data sets. The 20/20 percentage, which shows the percentage of values from the dataset that were within 20mg/dl from the reading, when the reference was less 100mg/dl and within 20% when it was greater than 100mg/dl. There are also “hypo count” metrics which show the percentage of readings that weren’t hypo when blood showed they should be, and the percentage of CGM readings that were hypo, when blood said otherwise.

20/20 Comparisons
Correct Hypo detection
Incorrect Hypo detection

What we can see from these tables is that the Libre 3 was more consistent and had considerably fewer datapoints that feel outside the 20/20 rule. Again, this is n=1 and the G7’s 12.8% is some way off the value seen in the accuracy study referenced earlier. 

While the percentage of hypos not detected looks high for the G6 and G7 (at 27.3%), it was better than the outcome from the G6 in the “Six of the best” experiment. 

The final table that we have shows the number of times that the different systems incorrectly called “Hypo” when blood said otherwise. The G6 and L3 were tied on this one, while the G7 never made that mistake. To some extent, the wider dispersion of the G6 data and the slightly negative bias that the consensus error grid appears to show for the Libre 3 seem to support these results. 

Any other observations?

One of the complaints from users of the G7 has been that the glucose data is more jumpy than they expected. Part of the reason for this is that the G6 and G7 software no longer backsmooths the historic data points, but there were occasions where I saw noticeable differences between the G6 and G7. There were times when the G7 data did appear to be quite jumpy, which wasn’t what I expected to see. however, before drawing too many conclusions about this, I intend to run another G7 to see whether this is consistent or just in relation to the backsmoothing.

What is your takeaway from this n=1 experiment?

I was surprised at the performance of the Libre 3. It produced a performance that was consistent with older versions of the Libre, in that it tended to measure on the lower side, but it was a lot closer to blood values than I expected. The data in this n=1 experiment suggested that it was the closest to fingerpricks without calibration. I’m interested to see whether that carries on across multiple sensors, as I know others have seen a much wider dispersion of values when compared to fingersticks.

The G7 was much more disappointing than I expected it to be. For what was supposed to be a more accurate version of the algorithm, the performance of the first four days was awful, and indeed, without calibration, I wonder what the results might have been. When I test the next G7, it will be done without any calibration. 

Ultimately, for me, the key takeaway from this experiment is that anyone using the G7 should check the performance over the first few days, and if necessary, calibrate the sensor. It was an outcome that I was not expecting.

27 Comments

  1. Reader is low cost , but middle men regional suppliers charge very high prices for sensors, Medicare cuts payments back a lot, but middle men still charge near double. It would help diabetics if medicare would allow our simple use of local pharmacy to get sensors, not regions!

  2. A very useful comparison!
    Also something extremely useful but not mentioned about the data.
    Most of us are able to make further remarkable improvements to those figures using a very simple method.

    The technique is fully described in the early sensor patents (actually owned by Dexcom)

    Preinsertion.

    Just look how both sets of data are improved if the first two days are eliminated.
    The patents clearly described how insertion trauma caused the sensor to be coated with macrophage cells which disrupted the readings.
    So ignore the first two days and what MARD do you then get?

    With Libre you do not even lose the operational days; simply insert, leave for two days, then start the sensor (and it’s clock). So no extra cost. You still get 14 days of data from the system.

    It is easy to extrapolate the day 12, 13, 14 data given in your figures to approximate a day 15, 16 (from insertion).

    An operational technique perhaps not limited to Libre.
    I will try to append the patent references below as a further comment.

    • The problem with pre-insertion, as you understand with the Dexcom G7, is that you lose two days of life on the device, which comes at a cost, either to the healthcare system or to yourself.

      With Libre3, you have the same problem with starting two days late that you do when restarting a G6, or pre-inserting it. The calibration algorithm for the first couple of days after starting is designed to take some level of account of the issues within those first couple of days, so you still get results that are out of line with the equivalent fingerpricks. It’s not like it’s using the steady state calibration.

  3. Thank you for this.

    It’s such a shame that if you are a Type 1 diabetic and live in the UK, you are unlikely to be able to access the product.

    I wonder, did any of the L3 or G7 sensors fail?

    And did you test whether the high and low glucose alarms worked? I know there are currently user problems with Libre 2 alarms not going off after some updates to Abbott’s software app on Android.

    • I can confirm that the low alarms worked on both devices, enough to need to disable them as best I could.

      No issues with Sensor longevity in this test. When a sensor fails, I normally record it, so was pleased that the L3 and G7 both survived their lives. It’s worth a note that the L3 was still very stuck down when I finally had to take it off.

  4. I trialled the L3 for my local NHS trust and found that the L3 was more accurate than the G6 (no g7 available) but the app was appalling. I had frequent lost connections and the app just crashing, it was so bad that I’d only move if the app was massively upgraded.

    • It’s interesting. I was running the app on a Samsung S22 Ultra and had no noticeable issues with the connectivity.

      I think it’s very phone related.

      What I didn’t mention in the review was that I ran the Dexcom G6, G7 and Libre3 all on the same phone, with AndroidAPS alongside it talking Bluetooth to my pump every five minutes. Bluetooth was solid throughout, suggesting that phone and OS play a big part in the issues people see.

    • Regarding those who found Libre phone Ap. Trouble, there are those who feel there are several reasons to use Libre reader, not phone Ap to read sensors. My own concern about Freestyle Libre is the medicare rule, if you are to get medicare help/ coverage, you buy sensors from middle men charging higher prices for sensor, but little value added.

        • I have used Libra 14 day, reader, since it became available. I have assumed, without testing, that the reader will read all sensors, 14 days.

          • It won’t. The libre1 sensor doesn’t work with Libre2, and Libre3 doesn’t have any flash readings.

        • If by receiver, you mean Reader, a Libre 3 Reader is available. Advertised on the German freestyle Libre site.

          • Great. I assume one must be available in other territories as well. I’ve not seen them advertised in the UK yet.

    • I think Libre 3 bluetooth connection is much better than Dexcom G7 in hardware level,I also use DiaBox to connect with Libre 3, the data capture rate is almost 100%(1440 readings per day,DiaBox captured about 1435 or so ).

  5. Your result that the Dexcom G7 misses 27% of lows is consistent with Dexcom’s own data. Dexcom claim it fails to alert for 24.2% to 26.7% of lows below 3.1 when when waiting for 15 minutes after measurement – even when using Urgent Low Soon alerts.

    page 379 – https://www.liebertpub.com/doi/pdf/10.1089/dia.2022.0011

    I am surprised you still call the Libre 3’s MARD of 7.9% “questionable” when your own result was 7.2%.

    With the Libre 3 having such good results all the way to the end of its 14 day life, I wonder if they will go to 16 days – cutting costs by 3 units per year ~ 12.5%.

    • I think it’s okay to call it questionable when the method used is entirely different to its predecessor and the other leading CGM. I’m also very aware this is n=1, and many others haven’t seen similar synchronicity with fingerpricks.

      Given the cost of regulatory approval and the reduction in revenue associated with a longer life sensor, I’d very surprised if it generated a 12.5% saving to the consumer…

    • I saw a you tube video, months ago, claiming you can get two added days of use for Libre 14 day sensor, if you use a phone Ap. And not the intended Abbott reader. The video claimants the reader linking to only 14’days, not the sensor . So use phone Ap., If true, can save $$. Also, For those using medicare insurance, for sensors, it blocks use of local pharmacy, those buying by medicare coverage are required to buy from more expensive regional suppliers.

  6. It’s all very well showing lots of graphs and using technical acronyms and words, but I think the majority of people reading this faceoff will be quite mystified. I am pretty intelligent with quite high technical qualifications and was confused by quite a lot of it. I also use mmol/l which meant converting in my head the change. What I, like most people, want to know is which manufacturer’s sensors have the best consecutive, accurate readings and as this was a one off study it cannot show that information. It does of course give an insight to the sensors used in the study including one version of the apps used. Also from what I understood from the study it was all completed with insulin pumps. I don’t understand how you calculated the errors as I could see no BG readings for compare them to. I realise that to do a full study would be expensive and time consuming, so i thank you for the information given as it does lead me to realise some of the differences between sensor accuracies.

    • Hi Mike, sorry you feel that way. The consensus error grids show the relationship between a blood glucose reading (the reference data) and a sensor reading. If they were all 100% the same they’d fall on the diagonal line, however, many don’t, and what you’re looking for is how far away they fall and whether the tend to fall higher or lower than the line in the majority. It doesn’t really matter what the units are, as the variation would be the same.

      Similarly, the MARD value is an industry standard for describing “accuracy” of a sensor, but is subject to the variations across different studies. What I attempt to do with these is compare multiple sensors tested in the same way, to see how they compare. Again, this is a normalised value, in percent, so doesn’t require the underlying glucose units.

      No study shows which sensors have the best consecutive, accurate, readings, as it would require blood readings every five minutes, and even the professionally undertaken venous blood comparisons limit it to every 15.

      Whether you use an insulin pump, a pen or a needle, is also irrelevant, as there can be high variation caused by all three (indeed, there were both occlusions on the pump leading to highs and a medical procedure resulting in larger numbers of low levels than normal included in the study).

      Really all this is attempting to show is comparable accuracy, bias and likelihood of detecting lows, using the same approach across all three sensors, which I hope it does.

    • The main “concern” about this study is that it is n=1. I have closely followed Libre comparisons for the 6 years that I have used it and been admin of a number of 10,000 plus, international and country Facebook groups. In these the vast majority of comparisons are simply n=1 but the great difference with this one is the EVIDENCE. I fully appreciate the difficulty in presentation of that in a way that many users can follow, but well done in this trial.

      The evidence is good, still n=1 but the result is Clear and worth so much more than so many reports that are without substantial evidence and little more than opinions.

      Abbott have trialed both 21 day and 28 day sensors. I have done 21 day. I found it very successful. I suspect that the main reason that Abbott do not extend the lifetime is the inevitable (but I believe small) increase in adhesion failures. Abbott will not wish to increase their adhesive failure rate. The FSL3 does go some way to address that by reducing the effect of user application.

      I do look forward to more trial results both independent and by Abbott with larger numbers: I feel sure they are being done but take time to complete: look how long the original FSL1 experiments took to report.

  7. Thank you for your reply. I apologise if you think I was being derogatory about your study, it wasn’t meant to be read that way. I expect people with a better understanding of the data you have collected will have more benefit from it. As I said it does give me an insight into differences between the sensors in terms of accuracy. I have found after reading it again, that more of the data is sinking in. Unfortunately, for me, I need to read some text a few times to benefit from it, as I am slightly dyslecit

  8. Hi Tim. Thank you for this. Were there any concerns about using a fingerstick meter for reference? My understanding is that the 20% error range isn’t technically from a blood meter, it’s from a true LAB VERIFIED sugar.
    While this study is 6 years old, I’m not sure much has changed. You can see that the author found some meters consistently reading 30% different from each other from the same drop! And, in one instance, meters reading 40% apart from the same drop. But why would that be surprising? Both are potentially within 20% of the “true” sugar, so is either “wrong.”
    TLDR: Why is the fingerstick meter the reference when there’s no indication of that meter’s accuracy (setting aside any issues related to unclean hands, different drop sizes, etc) or inherent bias (high/low)? Apologies if this is laid out elsewhere. Here’s the link to the study: https://medium.com/@chrishannemann/measure-seventy-five-times-cut-once-further-blood-glucose-meter-testing-9e769a853710

  9. Hi Jeremy, this is all where it gets a bit complex.

    The MARD used in studies is taken from venous blood extracted during clinic sessions and run through a YSI analyser. It’s worth bearing in mind that it normally goes through more than one and an average is taken as two, similarly calibrated devices, don’t necessarily give the same number.

    I tend to refer to the Diabetes Technology Society Blood Glucose Meter surveillance study, which has taken a look at a large number of meters with many samples and determined roughly how they fair. The details are here: https://www.diabetestechnology.org/surveillance.shtml

    It’s probably the most thorough review of available tech from 2018. Only a few meters were found to be within tolerance, and I deliberately use the Contour Next system as the one they found came top.

    I also use test fluid to ensure the readings are within tolerance on the test strips I’m using.

    So in short, yes, I’m aware of that, and I use the Contour next as a result of the DTS study, which has shown high correlation with laboratory sampling, so I’m very confident that the results are consistent on the same meter and useable as a reference for thisntype of study.

    • Thank you for the response. And I want to make clear, I appreciate the time and effort. This is super hard work outside of a lab, and you’re providing valuable data…but I still have concerns.

      1. While the Contour Next is within 15% 100% of the time in the study attached, it’s “only” within 5% 68% of the time. I don’t think it’s realistic to do better than that, but it’s important to note.

      2. Crucially, that can be 15% in either direction! I’m sure you know this, but many, many diabetics to not. So what this means is that if the lab value in the study is 100, two readings on the Contour Next could read 115 and 85…BOTH would be accurate!

      3. Here’s where I’m concerned about your results: In the above example, if the Contour read 85 (choosing that because its bias is below the true lab value) and the G7 shows 110 or 115, that would be a “failure” in your test…but it would be “accurate” under the parameters of the Blood Glucose Surveillance Article you linked to. This is my fundamental concern with using any fingerstick meter to test the accuracy of a CGM.

      4. The bias of the Contour next is -1.2%, while the G7 is apparently positive…but it must be noted that you are using a “negatively biased” meter to make that determination.

      5. To further complicate matters, while the Contour next bias is -1.2% on average, no one knows what happens on any individual reading. And this, is the ultimate issue. Meters have enormous error margins on any individual reading! Over time and larger sample sizes, those largely even out…but still only 68% of all readings are within 5% of the true lab verified glucose level.

      My TLDR: Even the best meters have accuracy issues on any given reading. So can we truly measure the accuracy of a CGM using a fingerstick meter as the baseline? Or is that baseline so inherently inaccurate it’s not really telling us much? I don’t have an answer, but I have concerns.

      Thanks again!

      • Your concerns are valid, but there’s also another aspect to this, which isn’t “accuracy” per se, but comparison to fingerpricks from a dosing perspective.

        If we don’t use CGM, then we’re dosing from SMBG, so comparing a CGM that’s approved for dosing with a meter that’s also approved for dosing is, I think, a valid comparison.

        What’s perhaps most interesting about the Contour data is that while 68% of the values are within +/-5%, ~95% are within +/-10%. So to put it another way, 2/3rds of the values you get are likely to be within 5%, a little over 1/4 within 10% and then 1/20 will be more than 10%.

        So it’s not just that there’s a range within which the values could fall, but also that there’s a likelihood of them being in that range.

        The reason for doing many many readings is obviously to try and manage the risk of individual readings having different variances. Given the statistics around variance and the number of finger rpicks undertaken, on any given day, you could expect 5 to be within 5%, 2 to be within 10% and the remainder to be further out.

        The other thing worth bearing in mind is that the instruments used to generate clinical MARD values also have a tolerance of around +/- 5%, so however we look at this, accuracy is a moving target.

        So what these studies are attempting to show is that if you’re moving off SMBG on to CGM, this is the difference between CGM and the leading fingerprick tester, and that if you are dosing off a device, or expecting hypo detection, this is what the differences look like.

        And if something is reading significantly higher or lower than a source with a well observed statistical distribution, then maybe you have a right to be concerned.

        • Right. This compares GGMs to fingerpricks from a Contour Next meter for a dosing perspective. Not to fingerpicks in general, and it does make a difference. And I do understand you’re not claiming anything different.
          Here’s why it’s important: The G7 has a post-calibration bias of +8%. But look at the chart in the link I provided. The OneTouch Ultra2 (not as accurate as the Contour, but still on the market) has a +12% bias from the Contour Next from the same drop of blood. So on average (I think this is right) if the Contour reads 100, the G7 reads 108 and the Ultra2 reads 112. I understand that based on MARD readings, the Contour is probably closest to “right” (across a large sample size), but if the goal of this study is comparing the G7 to a fingerstick meter, I do think it’s important to mention glucose meter inconsistencies and their own biases even when reading the same drop of blood. The G7 does read higher than the Contour, but (crucially) it may read lower than an Ultra2!
          My take on all of this: in a carefully controlled environment, finger sticks may be more accurate on a given reading. But they may not be, particularly if a person don’t wash hands and make sure the strips aren’t too hot or cold, the amount of blood is sufficient, etc. But given the human element, I think CGMs are more than accurate enough. Then add in the fact they’re providing readings every 5 minutes and I think the benefits far outweigh any minor accuracy issues (particularly because those issues are true over thousands of readings, but may not be true for any one reading).
          I’ve been on a Dexcom since 2009 and used it for dosing nearly the entire time…so I may be a bit biased 😂 I do appreciate this data and understand it’s completely impossible to do a comparison of multiple CGMs and multiple meters. Thanks again.

Leave a Reply

Your email address will not be published.


*