Supplementing training with data from a shifted distribution for machine learning classifiers: adding more cases may not always help

KH Cha, A Gossmann, N Petrick… - Medical Imaging 2020 …, 2020 - spiedigitallibrary.org
Medical Imaging 2020: Image Perception, Observer Performance, and …, 2020spiedigitallibrary.org
In this study, we show that when a training data set is supplemented by drawing samples
from a distribution that is different from that of the target population, the differences in the
distributions of the original and supplemental training populations should be considered to
maximize the performance of the classifier in the target population. Depending on these
distributions, drawing a large number of cases from the supplemental distribution may result
in lower performance compared to limiting the number of added cases. This is relevant for …
In this study, we show that when a training data set is supplemented by drawing samples from a distribution that is different from that of the target population, the differences in the distributions of the original and supplemental training populations should be considered to maximize the performance of the classifier in the target population. Depending on these distributions, drawing a large number of cases from the supplemental distribution may result in lower performance compared to limiting the number of added cases. This is relevant for medical images when synthetic data is used for training a machine learning algorithm, which may result in a mixed distribution for the training set. We simulated a twoclass classification problem and determined the performance of a linear classifier and a neural network classifier on test cases when trained with cases from only the target distribution, and when cases from a shifted, supplemental distribution are added to a limited number of cases from the target distribution. We show that adding data from a supplemental distribution for machine learning classifier training may improve the performance on the target test distribution. However, given the same number of training cases from a mixed distribution, the performance may not reach the performance of only training on data from the target distribution. In addition, the increase in performance will peak or plateau, depending on the shift in the distribution and the number of cases from the supplemental distribution.
SPIE Digital Library
Showing the best result for this search. See all results