Characterization of case-based classification repeatability and variability of operating points can complement measures of classification performance in artificial intelligence/computer-aided diagnosis (AI/CADx). Building upon our previous work in this area using human-engineered radiomic features extracted from dynamic contrast-enhanced magnetic resonance (DCE-MR) images, we investigated the application of these methods to features extracted from pre-trained convolutional neural networks using deep transfer learning. The second post-contrast DCE-MR images for 601 unique breast lesions (194 benign, 407 malignant) were cropped and resized for input into a VGG-19 network, pretrained using ImageNet. Features were extracted and average pooled from the five max-pool layers, resulting in 1,472 features for each lesion. The assignment of cases to training and test sets was varied using a 1000-iteration 0.632 bootstrap. Overall classification performance for distinguishing between malignant and benign cases (using the area under the receiver operating characteristic curve (AUC) with 0.632+ bootstrap correction), case-based classification repeatability (using repeatability profiles which measure the 95% confidence interval (CI) of classifier output across its range), and attainment of a ‘preferred’ target (95%) or ‘optimal’ sensitivity and specificity were investigated using a random forest classifier. The AUC (median, [95% CI]) was 0.862 [0.806, 0.899]. The repeatability profile and attained sensitivity and specificity were similar to previous results for both the ‘preferred’ and ‘optimal’ targets when using human-engineered radiomic features. These results demonstrate the application of these methods to complement AI/CADx model assessment when using deep transfer learning features.
|