Google Scholar

Supplementing training with data from a shifted distribution for machine learning classifiers: adding more cases may not always help

KH Cha, A Gossmann, N Petrick… - Medical Imaging 2020 …, 2020 - spiedigitallibrary.org

KH Cha, A Gossmann, N Petrick, B Sahiner

Medical Imaging 2020: Image Perception, Observer Performance, and …, 2020•spiedigitallibrary.org

In this study, we show that when a training data set is supplemented by drawing samples from a distribution that is different from that of the target population, the differences in the distributions of the original and supplemental training populations should be considered to maximize the performance of the classifier in the target population. Depending on these distributions, drawing a large number of cases from the supplemental distribution may result in lower performance compared to limiting the number of added cases. This is relevant for medical images when synthetic data is used for training a machine learning algorithm, which may result in a mixed distribution for the training set. We simulated a twoclass classification problem and determined the performance of a linear classifier and a neural network classifier on test cases when trained with cases from only the target distribution, and when cases from a shifted, supplemental distribution are added to a limited number of cases from the target distribution. We show that adding data from a supplemental distribution for machine learning classifier training may improve the performance on the target test distribution. However, given the same number of training cases from a mixed distribution, the performance may not reach the performance of only training on data from the target distribution. In addition, the increase in performance will peak or plateau, depending on the shift in the distribution and the number of cases from the supplemental distribution.

SPIE Digital Library

Show moreShow less

Showing the best result for this search. See all results

Cite

Advanced search

Saved to My library

Supplementing training with data from a shifted distribution for machine learning classifiers: adding more cases may not always help