AUGMENT: a framework for robust assessment of the clinical utility of segmentation algorithms

Evaluating AI-based segmentation models primarily relies on quantitative metrics, but it remains unclear if this approach leads to practical, clinically applicable tools.

Purpose

To create a systematic framework for evaluating the performance of segmentation models using clinically relevant criteria.

Materials and Methods

We developed the AUGMENT framework (Assessing Utility of seGMENtation Tools), based on a structured classification of main categories of error in segmentation tasks. To evaluate the framework, we assembled a team of 20 clinicians covering a broad range of radiological expertise and analysed the challenging task of segmenting metastatic ovarian cancer using AI. We used three evaluation methods: (i) Dice Similarity Coefficient (DSC), (ii) visual Turing test, assessing 429 segmented disease-sites on 80 CT scans from the Cancer Imaging Atlas), and (iii) AUGMENT framework, where 3 radiologists and the AI-model created segmentations of 784 separate disease sites on 27 CT scans from a multi-institution dataset.

Results

The AI model had modest technical performance (DSC=72±19 for the pelvic and ovarian disease, and 64±24 for omental disease), and it failed the visual Turing test. However, the AUGMENT framework revealed that (i) the AI model produced segmentations of the same quality as radiologists ( p =.46), and (ii) it enabled radiologists to produce human+AI collaborative segmentations of significantly higher quality ( p =<.001) and in significantly less time ( p =<.001).

Conclusion

Quantitative performance metrics of segmentation algorithms can mask their clinical utility. The AUGMENT framework enables the systematic identification of clinically usable AI-models and highlights the importance of assessing the interaction between AI tools and radiologists.

Summary statement

Our framework, called AUGMENT, provides an objective assessment of the clinical utility of segmentation algorithms based on well-established error categories.

Key results

Combining quantitative metrics with qualitative information on performance from domain experts whose work is impacted by an algorithm’s use is a more accurate, transparent and trustworthy way of appraising an algorithm than using quantitative metrics alone. The AUGMENT framework captures clinical utility in terms of segmentation quality and human+AI complementarity even in algorithms with modest technical segmentation performance. AUGMENT might have utility during the development and validation process, including in segmentation challenges, for those seeking clinical translation, and to audit model performance after integration into clinical practice.

Full text links

Read article at publisher's site: https://doi.org/10.1101/2024.09.20.24313970

Citations & impact

This article has not been cited yet.

Impact metrics

Alternative metrics

Altmetric item for https://www.altmetric.com/details/168260735

Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/168260735

Search life-sciences literature (45,104,145 articles, preprints and more)