Following on from the results of the nine sensor samba, I went on to produce the Consensus Error Grids. These provide a view of the distribution of the datapoints throughout the period of the experiment compared to the reference fingerpricks, and also provide a mechanism by which we can assess how safe these sensors are for insulin dosing decisions and for use with Automated Insulin Delivery systems.
Interpreting Consensus Error Grids
Before we dive in to the details of the CEGs that I produced, let’s take a brief overview of the reason we might want to look at this. The key details are:
- CEGs are split into multiple zones, A-E.
- Each zone represents a different level of clinical risk, with A being essentially zero and E being very high risk.
- Ideally, any CGM would have as many of its points as possible in Zone A. Whilst zone B is perhaps not as risky as zones C-E, there may still be hypo risks
- When using an AID system, decisions being made on data that might be in zone B, especially towards the boundaries with Zone C are potentially higher risk as they may result in adverse actions.
With this in mind, I’ll delve into all the error grids, then take a look at the split of data points for each sensor and what I think this means. I’ll also look at the cost of each sensor and see if there is any correlation between that and the test outcomes.
The error grids
As we have nine sensors, there are nine error grids to review. We’ll start with the “big four” (Dexcoms and Libres) then move into the newcomers.









Now, there’s a great deal of data on display in these grids. Looking at them pictorially, it’s fairly obvious which of the sensors had the highest concentration within zone A, and whilst the Dexcom G7 and Libres both scored very well in this, the Syai Tag also appears to align with these. Perhaps unusually, in this test, the Dexcom G6/One has a very wide dispersion, which is substantially higher than I’ve seen in other tests. The image below is representative of what has been more common for the G6/One, but it’s worth highlighting that the sensor used to produce the chart shown below was calibrated..

Distribution of points in Consensus Error Grids
Whilst this is a visual way to view the performance of each of these sensors, we also care about the percentage of points in each zone. The table below provides this information.

What this data shows is that in reality, none of these sensors is really “poor”. Across all of the sensors used, there was one data point in Zone C, which somewhat ironically was also on the sensor that had the highest proportion of points in Zone A.
To give a visual representation of the above data, the below stacked chart shows just how close they all are.

To give this some context, if we look back at the the original Freestyle Libre accuracy study from 2015 (which I covered here), it showed 99.7% of all readings in Zones A and B compared to capillary blood. We can see here, pretty much all the sensors tried in 2025 are producing numbers that don’t seem to be that far removed any worse than the Libre we had in 2015.

What do we take from the error grid analysis?
The error grid analysis in this n=1 suggests that even without calibrating any of the sensors, the reality is that most of the them demonstrate numbers that are nominally considered to be safe, and don’t seem to be any worse than sensors from 10 years ago (although I’d question how much difference there really is between a Libre from 2015 and a Libre3+).
The key take away is that realistically, all the sensors showed CEG data that fell within either clinically zero risk or extremely low risk zones. And all except the Libre sensors support calibration. What’s perhaps a bit more nuanced about the CEG data is to take a look at the number of datapoints trending towards zone C. In the Yuwell, iCan and Linx, we see a number of points very close to the border with zone C. More importantly, they are showing the sensor reading high when the reference is either in the low zone or close to it. The misaligned datapoint in zone C for the Dexcom G7/One+ was the other way around. It showed a low sensor reading when there was a higher glucose reading, which is less likely to lead to excessive insulin delivery.
Ultimately, what the CEGs, MARD numbers, Bias data and hypoglycaemia analysis all show is that no single metric is really good enough to describe the performance of a CGM system. They are complex technology operating within a complex system, and while single datapoints make for good marketing, they don’t make for useful information in relation to system performance.
How do these values correlate with cost?
Let’s be blunt about this. Not very well. The table below shows the costs (based on costs in both the UK and Europe, as not all sensors are available in the UK). Is there a direct correlation between price and MARDf or price and percentage of readings in zone A?

There is, but it’s again, perhaps not quite what you[‘d expect. There’s a clear negative correlation between price and MARDf as tested, however there’s much less of a correlation between the percentage of results in Zone A and price. This is heavily influenced by the outcomes of the Syai Tag, which as the cheapest on offer and the lowest MARDf, does skew the results.
Would you choose to use any of these newcomers with an AID system?
That’s potentially the million dollar question. But I do have an answer to it.
Based on the data I’ve seen in this n=1 experiment I’d feel comfortable adding both the Syai Tag and the Caresens Air to my open source AID system. The other newcomers? Not so much.
Why no cost per day for Dexcom G6/G7?
The Libre 3+ MARD includes young children (2YO+), do the other CGM MARDs include young children?
All the Dexcom and Abbott systems support kids from 2, including the One and One+ as far as I’m aware. I haven’t looked at the others, so you’ll be able to check as easily as me.
And why no costs for G6/G7 – just forgot to include them. The only comment would be they are considerably higher than those for One+.
An impressive set of data – just doing this type of assessment with 2 worn CGM devices is onerous (as I know from my own experience).
As you have constructed the plots, would you be able to do a quick linear regression analysis on each and post the equations + R-squared values? Some of the plots look like there’s differences in calibration between the different systems. Thanks
I can provide an update.
What I’d note is that all the sensors are shown with their factory calibration intact throughout the trial. There was no user-based calibration at any point in time.
Thus we don’t know what the calibration basis was for any of the sensors.