Synergizing medical imaging and radiotherapy with deep learning

Hongming Shan; Xun Jia; Pingkun Yan; Yunyao Li; Harald Paganetti; Ge Wang

doi:10.1088/2632-2153/ab869f

1. Introduction

McCarthy et al [1] organized the Dartmouth workshop in 1956 to initiate artificial intelligence (AI) as a research field with a lofty goal to simulate, enhance, or even surpass human intelligence. Given the tremendous potentials and challenges, the excitements and frustrations are equally remarkable. Their interactions lead to alterations of AI springs and winters, through which the AI field has been developed step by step, and elevated to today's level, and we believe that this field will have an even brighter future.

Currently, AI is in a new spring, especially its sub-field machine learning (ML) which enjoys rapid development and constant innovations featured by deep neural networks, also known as deep learning. On August 30, 2019, the White House issued a memorandum on the Fiscal Year 2021 Administration Research and Development Budget Priorities [2], underlining that 'departments and agencies should prioritize basic and applied research investments that are consistent with the 2019 Executive Order on Maintaining American Leadership in Artificial Intelligence and the eight strategies detailed in the 2019 update of the National Artificial Intelligence Research and Development Strategic Plan.' According to the World Economic Forum [3], it is estimated in 2018 that 'by 2025, the amount of work done by machines will jump from 29% to more than 50% –but that this rapid shift will be accompanied by new labour-market demands that may result in more, rather than fewer, jobs.' 'Preparing the workforce for these changes will depend on a data-driven approach.'

In the overwhelming optimistic atmosphere of deep learning, hypes are also unavoidable that exaggerate the power of AI and over-promise the return of some AI-related investments. Furthermore, some peers wonder if the potential of AI or deep learning techniques has been largely saturated, and worry if another AI winter is coming. In 2018, a group with the MIT Technology Review analyzed 16,625 arXiv papers in the AI section published since 1993, and found that the momentum is slowing down [4].

While the future is always hard to predict such as how long the AI spring will last or how soon the AI winter will come again, as a group of medical physicists and AI engineers we believe that the AI spring will last into the near future. In addition to potential new methodological and technical progresses, we are confident that major efforts are needed to translate current (and future) deep learning methods into medical physics and clinical practice. This need is a powerful engine that will drive AI research and development productively for at least ten years.

Two well-known AI-related areas in the field of medical physics are medical imaging and radiation therapy (radiotherapy). The penetration of AI or deep learning into these two areas has been significant. Generally speaking, more deep learning methods and results were published on medical imaging than those on radiotherapy. On the other hand, in a good sense radiotherapy is subject to more uncertainties than medical imaging, since classic medical imaging technology can generally deliver decent image quality already while imaging guided radiotherapy still suffers from insufficient margin definition, mismatches between relevant images, and difficulties in predicting outcomes and optimizing therapies. Figure 1 illustrates the big picture of deep imaging towards radiotherapy, which is adapted from our perspective on deep imaging [5]. It is widely known that machine learning especially neural networks can be used for medical image analysis from images to features. What we became first interested in is tomographic image formation. Since we use deep learning techniques, we call this area data-driven, learning-based, deep tomographic reconstruction or simply 'deep recon'. The process from data to images and the one from images to features can be integrated to form a so-called 'end-to-end' workflow in the unified machine learning framework, that is, going from raw data directly to diagnosis. Since the raw data is the starting point, we call this process 'rawdiomics', instead of 'radiomics'. This perspective of tomographic reconstruction also served as the basis for the first special journal issue dedicated to the theme of 'Machine Learning for Image Reconstruction' [6]. Those features can be used to guide radiotherapy in particular and medical interventions in general.

Figure 1. Refer to the following caption and surrounding text. — **Figure 1.** Overall big picture of deep imaging towards radiotherapy.
Download figure:
Standard image High-resolution image

In the following, we first give a brief introduction to deep learning architectures in section 2. We then review medical imaging in section 3 covering image reconstruction, segmentation, registration, and radiomics, and radiotherapy in section 4 covering planning, verification, and prediction. Note that we review those papers with high numbers of citations and recent key papers to reflect the state of the art in both fields. Finally, section 5 and section 6 discuss future topics and conclude this article, respectively.

2. Deep learning

Deep learning (DL), as a mainstream of ML, uses trainable computational models that contain multiple processing components with adjustable parameters to learn a representation of data [7]. DL methods are mainly based on artificial neural network (ANN), which were inspired by information processing in biological neural systems, but with various abstractions of and approximations to the biological counterparts. Over the past decade, researchers proposed various deep learning architectures for different tasks. Figure 2 illustrates the DL architectures widely used in medical imaging and radiotherapy areas. For a broader and more detailed introduction of DL, please refer to these DL books [8–10].

Figure 2. Refer to the following caption and surrounding text. — **Figure 2.** Popular deep learning architectures for medical imaging and radiotherapy.
Download figure:
Standard image High-resolution image

2.1. Multilayer perceptron

The most vanilla neural network is a multilayer perceptron (MLP), which mimics the human brain [11]. MLP is based on a collection of connected nodes called artificial neurons, which simulates biological neurons (figure 2(a)). Each link between two nodes can transmit a signal from one to the other, which plays a role similar to that of the synapses. A non-linear activation function, such as Sigmoid or Tanh, is used to represent the rate of action potential firing or the state of neural excitation.

The universal approximation theorem states that MLP with a single hidden layer containing a sufficiently many neurons can approximate a continuous function on compact subsets very well in a n-dimensional space [12]. Therefore, MLP has been used for tasks that involve a non-linear mapping, including computer vision, speech recognition, machine translation, etc. However, the fully-connected configuration would dramatically increase the number of model parameters as the problem size becomes large, bringing difficulties in training and testing.

2.2. Convolutional neural network

A convolutional neural network (CNN) is one of the most successful neural networks [13–15]. The CNN architecture reflects the four key ideas: local connections, shared weights, pooling, and the integration of multiple layers [7]. A CNN consists of an input layer and an output layer, as well as multiple hidden layers [16] (figure 2(b)). The hidden layers typically consist of a series of convolutional layers convolving a local input with filters to produce the corresponding local output. The activation function is commonly a rectified linear unit (ReLU) [17], and is subsequently followed by additional operations such as pooling layers (max-pooling or average-pooling). Several fully-connected layers can be used to learn a global representation from feature maps. More precisely, those convolutional layers perform sliding dot products or cross-correlations and then non-linearly process the outcomes, while the mathematical convolutions are only linear operations.

CNN is great for capturing local features from data such as images or text, and can significantly reduce the model complexity compared to MLP. CNN is commonly used for image classification, which is to predict the label of the given input image. Note that although translational invariance introduced by pooling layers is important for object recognition, Aydore et al [18] pointed out that it is not applicable to brain activation images, in which meaningful structures are location specific in the brain.

2.3. Fully convolutional network

Just as its name implies, a fully convolutional network (FCN) is composed of convolutional layers without any fully-connected layers (figure 2(c)). Long et al [19] first indicates that a CNN with fully-connected layers can be implemented with all convolutional layers for semantic segmentation. A popular FCN structure is a convolutional encoder-decoder network, in which the encoder has several convolutional layers to encode an input image while the decoder has several deconvolutional layers to decode the representation. The structure of a convolutional encoder-decoder network is usually symmetric. Based on the convolutional encoder-decoder networks, variants can be obtained using different skip connections between the encoder and the decoder. For example, the residual encoder-decoder convolutional neural network (RED-CNN) with skip connection that adds the feature-maps in the encoder to later layers in the decoder [20], and U-net with skip connection that copies the feature-maps in the encoder as the input to later layers in the decoder [21]. Given the same depth and width, U-net is more flexible but it has slightly more parameters than RED-CNN.

FCN aims at learning a representation and making a decision based on its local input, which has been widely used in image-to-image translation tasks such as segmentation, registration, and image processing. Due to the use of all convolutional layers, the FCN does not depend on the input size, which means that one can train FCN on image patches and apply the trained model to full-size images.

2.4. Generative adversarial network

Recently, the generative adversarial network (GAN) [22] attracted a major attention, which is a powerful way to model complicated distributions of data. A GAN has a pair of two sub-networks: a generator and a discriminator (figure 2(d)). The generator takes a random noise as its input to generate a sample. The discriminator receives numerically generated and real samples, and does its best to distinguish between these two kinds of samples. This is a game between these two sub-networks: the generator learns to produce more and more realistic samples, and the discriminator needs to become smarter and smarter at distinguishing fake data from real ones. These two players are trained alternatively, and the goal is that the competition drives generated samples to be indistinguishable from real data.

In the original GAN, the input to the generator is a noise vector sampled from a predefined distribution. Now, GAN has been extended in a number of ways and applied to image processing and other tasks, playing an important role in regularizing the network outcomes. To a major extent, the adversarial loss of GAN resembles the perceptual loss that is based on a pre-trained VGG network [23], but adversarial loss is learnt from given data adaptively. GAN has rapidly evolved in recent years; see a recent survey [24]. Notably, an unpaired variant of GAN, called cycleGAN [25], attracted strong interests since it does not require paired images, which is highly desirable in many applications in which it is expensive to obtain paired images.

2.5. Recurrent neural network

A recurrent neural network (RNN) is a type of network in which the output from a previous step is fed into the current step (figure 2(e)). Unlike traditional neural networks whose data flows are in the feed-forward mode, RNN has a 'memory' remembering what have been seen. For example, when we predict the next word in a sentence, the previous words are required. Hence, RNN needs to remember the previous words. The most important feature of RNN is hidden states, which record some information about a sequence. It typically uses the same parameters for each input as it performs the same task to produce the output, which helps reduce the model complexity.

RNN is widely used in dealing with sequential data such as text or video. However, RNN suffers from vanishing and exploding gradients. The famous long short-term memory (LSTM) network [26] was then proposed to address the shortcomings of RNN using multiple switch gates including input, output, and forget gates. Later on, gated recurrent units (GRU) [27] were introduced, similar to LSTM but with fewer parameters.

2.6. Deep reinforcement learning

Deep reinforcement learning (DRL) combines DL (function approximation) with a reinforcement learning (target optimization), which drives agents to learn the best actions to achieve their goals in a virtual environment by mapping state-action pairs to expected rewards (figure 2(f)). The combination of a deep neural network and a reinforcement learning algorithm has led to some breakthroughs such as Google Deepmind's AlphaGo [28], which beats the world champions of the Go game. In other words, DRL is a goal-oriented algorithm, which learns to achieve a complex goal by maximizing the associated objective function over many rewarding/penalizing steps. It has been widely used in video games, robotics, finance and healthcare [29].

3. Medical imaging

3.1. Image reconstruction

While many of us enthusiastically embrace the new wave of medical imaging research with deep learning, there are also doubts and concerns from some other colleagues. This conflict of options is natural and healthy. In retrospect, at the beginning of the development for analytic reconstruction, there was a major critique that given a finite number of projections, the tomographic reconstruction is not uniquely determined (ghosts) [30, 31]. Later, this was successfully addressed through regularization. When iterative reconstruction algorithms were first developed, it was observed that a reconstructed image was strongly influenced by the penalty term; in other words, it appeared that what you reconstructed is what you wanted to see; for example, in the under-determined case described in [32]. Nevertheless, by optimizing the data acquisition protocol, the reconstruction parameters, and the stopping criteria, iterative algorithms were made mature enough into commercial scanners [33]. As far as compressed sensing is concerned, it is widely known that there is a chance that a sparse solution is not the truth [34]. For example, a tumor-like structure could be introduced, or pathological vessel narrowing might be smoothed out or digitally treated if total variation is overly minimized [35]. Currently, deep learning presents issues in practice such as the black box problem, which means lack of interpretability of the successes of deep learning methods. Indeed, there is no Maxwell equations for deep learning yet, and a neural network as a black box is trained to work with big data in terms of parameter adjustment. The interpretability of neural networks remains a hot topic. Given rapid progresses in theoretical and practical aspects, we believe that machine learning algorithms will become the mainstream for medical imaging [36]. In August 2019, Tufts University organized the Conference on Modern Challenges in Imaging to celebrate the 40th anniversary of Allan Cormack's Nobel prize. Leading researchers shared their works, insights, and concerns about applying deep learning to medical image reconstruction; for more information please refer to [37].

Traditionally, tomographic image reconstruction algorithms are categorized into analytic reconstruction and iterative reconstruction [36]. Thanks to DL, now a new category of image reconstruction methods is emerging, which are ML- or DL-based [5, 6, 36, 38]. It is our point of view that in principle deep reconstruction methods ought to outperform analytic reconstruction, iterative reconstruction, and compressed sensing algorithms for tomographic image reconstruction because of the following three arguments [5, 6, 36]. First, the results from analytic and iterative algorithms can be used as the baseline, from which DL can improve image quality further in various ways; for example, the baseline and raw data can be integrated into a deep network to produce better results. Second, valuable ingredients of analytic and iterative algorithms can be used in a neural network to enhance the network instead of competing with it. Third, prior knowledge commonly used for iterative reconstruction and compressed sensing can be greatly enhanced or even replaced by neural networks equipped with much more extensive priors learnt from big data.

Table 1 summarized some recently proposed deep-learning-oriented image reconstruction algorithms, where we present those methods in terms of modality, input format, input type, network structure, skip connection, level of supervision, and if GAN is used for optimization.

Table 1. Recently proposed deep-learning-based image reconstruction methods.

		Input Format		Input		Skip Connection		Supervision Level
Ref	Modality	Img.	Raw	Type	Network Structure	Sum	Conc.	Sup.	Unsup.	GAN
[44]	CT	$\surd$		2D	FCN			$\surd$
[20]	CT	$\surd$		2D	FCN(E-D)	$\surd$		$\surd$
[45]	CT	$\surd$		2D	FCN			$\surd$		$\surd$
[42]	CT	$\surd$		2.5D	FCN(E-D)		$\surd$	$\surd$		$\surd$
[46]	CT	$\surd$		2D	FCN			$\surd$
[47]	CT	$\surd$		2D	FCN(E-D)		$\surd$	$\surd$		$\surd$
[48]	CT	$\surd$		2D	FCN	$\surd$	$\surd$	$\surd$
[49]	CT	$\surd$		3D	FCN	$\surd$		$\surd$		$\surd$
[50]	CT	$\surd$		3D	FCN			$\surd$		$\surd$
[51]	CT		$\surd$	2D	FCN	$\surd$		$\surd$
[52]	CT		$\surd$	2D	CNN		$\surd$	$\surd$		$\surd$
[53]	CT		$\surd$	2D	FCN			$\surd$
[54]	CT		$\surd$	2D	CNN			$\surd$
[55]	CT		$\surd$	2D	CNN/FCN(E-D)	$\surd$		$\surd$		$\surd$
[56]	CT/MRI	$\surd$		2D	FCN(E-D)		$\surd$		$\surd$
[57]	CT	$\surd$		2D	FCN	$\surd$	$\surd$		$\surd$	$\surd$
[58]	CT	$\surd$		2D	FCN(E-D)	$\surd$	$\surd$	$\surd$
[59]	MRI		$\surd$	2D	FCN	$\surd$		$\surd$
[60]	MRI		$\surd$	2D	CNN			$\surd$
[61]	MRI		$\surd$	2D	FCN	$\surd$		$\surd$
[62]	MRI	$\surd$		2D	FCN(E-D)	$\surd$		$\surd$		$\surd$
[63]	MRI	$\surd$		2D	FCN(E-D)		$\surd$	$\surd$
[64]	MRI	$\surd$		2D	FCN	$\surd$		$\surd$		$\surd$
[65]	MRI	$\surd$		2D	FCN(E-D)	$\surd$	$\surd$	$\surd$		$\surd$
[66]	PET	$\surd$		2D	FCN			$\surd$
[67]	PET	$\surd$		2D	MLP			$\surd$
[68]	PET	$\surd$		3D	FCN(E-D)		$\surd$		$\surd$
[69]	PET		$\surd$	3D	FCN(E-D)		$\surd$	$\surd$
[70]	SPECT		$\surd$	2D	CNN/FCN(E-D)			$\surd$

Modality: Here we only discuss those image modalities that have been widely used in radiotherapy; i.e. computed tomography (CT), magnetic resonance imaging (MRI) and nuclear image techniques such as positron emission tomography (PET) and single-photon emission computed tomography (SPECT). Among those modalities, CT and MRI are two extensively studied modalities for image reconstruction due to the public availability of low dose CT and fast MRI datasets. Originally designed for a low dose CT reconstruction challenge, the Mayo low dose CT dataset contains perfectly registered low dose and normal dose CT scans, and has became a benchmark dataset in deep CT image reconstruction [39]. Deep fast MRI aims at accelerating MR imaging with AI, gaining increasingly attention [40, 41]. Nuclear imaging techniques such as PET and SPECT are also instrumental for modern radiotherapy, particularly to detect and characterize tumors before and after treatment such as radiotherapy and immunotherapy.

Input Format: DL-based image reconstruction can be done in two ways: directly mapping from raw data to a tomographic image and image post-processing from a reconstructed image to an improved version. The raw data are in a sinogram data for CT/PET/SPECT and k-space data for MRI. Directly reconstructing image from raw data requires the deep learning model to learn the physics process involved in the reconstruction (the current models such as Radon transformation or Fourier transform are only approximations) and utilize information on the image content hidden in training data. Alternatively, the image post-processing is relatively simple yet effective, after a low-quality image is already reconstructed using a classic reconstruction algorithm such as filtered back-projection or Fourier transform. We note that the raw data are usually not available to researchers due to restrictions by vendors and anticipate that the deep learning model based on raw data should outperform the model based on reconstructed images.

Input Type: Currently, the input type to deep learning models is mainly 2D due to limited computing resources. However, 3D image post-processing techniques have been proved to give superior results [42, 43]. For raw data, it is also desirable to do 3D reconstruction through deep learning, given sufficient GPU memory.

Network Structure: The network structures are roughly divided into MLP, CNN, FCN and FCN(E-D). Here FCN(E-D) represents the convolutional encoder-decoder network, which is one of popular FCN architectures. For image post-processing, FCN or FCN(E-D) are typically used to learn an image-to-image mapping. For directly mapping from raw data to a tomographic image, most methods use FCN and FCN(E-D) in the image domain processing even though some methods use fully-connected layers to learn a domain transfer from raw data to images.

Skip Connection: Skip connections are widely used to enhance the network flexibility and improve the model performance. In the so-called residual dense block, it is remarkable that skip connections are wired in a sophisticated fashion to increase the model capability.

Supervision Level: Most deep image reconstruction methods are based on supervised learning, which differs from traditional reconstruction methods that are unsupervised. Deep image post-processing methods usually require the paired images to learn a mapping from low-quality to high-quality images. These paired images are usually unavailable in practice. In this case, the numerical noise insertion can be used for low-dose CT; for example, the Mayo low-dose CT dataset was simulated to produce quarter dose counterparts from the normal dose sinograms. Some recent methods attempt to address the reconstruction process in an unsupervised learning mode, avoiding or weakening the need for paired data, which represents a current hot topic.

GAN: For image reconstruction, GAN is widely used to enhance the image quality. The benefits from GAN is that it can introduce a data-driven regularizer to ensure that the learnt distribution approaches the ground-truth. That is, tomographic images are generated or reconstructed as faithfully as a numerical observer or discriminator can describe.

In order to train a deep recon model, an appropriate objective function should be chosen. The widely-used losses for image reconstruction include mean-squared error, mean absolute error, structure similarity [71], perceptual loss [72], adversarial loss [22], and so on. The combination of these losses depends on specific tasks and even datasets. For example, the mean-squared error can reduce image noise, but it tends to over-smoothen images. The adversarial loss enhances the image quality with a discriminator but it requires more training efforts. For more discussions, please refer to [73].

To validate the model, quantitative image quality metrics such as peak signal-to-noise ratio, structure similarity, root mean square error are typical employed to compute the difference between the generated and ground-truth images. However, a higher quantitative metric does not always mean a better diagnostic performance. Therefore, task-specific measures are clinically most important, such as receiver operating characteristic (ROC) and area under curve (AUC) obtained in a human reader or numerical observer study. Neural networks were recently developed to perform reader studies [74, 75].

3.2. Image segmentation

Medical image segmentation aims to delineate the boundary of an organ or lesion of interest in medical images. In radiotherapy, segmentation plays a crucial role, since it is needed to identify the target lesion and avoid healthy tissues during treatment. Currently, the treatment target and normal structures are commonly delineated by oncologists, often slice by slice, which is time-consuming and cumbersome. In addition, the boundaries delineated by different oncologists may vary significantly. Even for the same oncologist, he/she may not be able to reproduce his/her own delineation very well [76–78]. Therefore, automated image segmentation has been a hot research area over the past decade.

Traditional automatic segmentation methods for radiotherapy are based on analysis of image content and properties such as voxel intensities, gradients, and textures [79]. Based on the information involved, traditional automatic segmentation methods can be divided into the following categories: region-based methods [80–82], edge detection-based methods [83–85], atlas-based methods [86–90], statistical-models-based methods [91–94], and machine learning methods [95–98]. Among these methods, machine learning methods produce promising results as they learn the image prior in a data-driven manner. These methods used to be based on traditional machine learning algorithms such as support vector machine (SVM) [99], random forest [100], and Gaussian process [101], which suffer from insufficient capability and unsatisfied performance for clinical tasks.

Deep learning, as the mainstream of machine learning research, is now shaping this area rapidly. Different from the traditional machine learning algorithms, deep learning allows computational models that are composed of multiple layers to learn a representation of data and knowledge from big data [7]. As a result, deep learning methods can dramatically improve the state-of-the-art in the image segmentation field, showing a great potential to facilitate radiotherapy.

In this section, we survey the deep learning-based segmentation methods developed in recent years as summarized in table 2. For clarity, we will discuss those methods in terms of organ/tumor, modality, input, network, skip connection, supervision, and training data. We also make some notes in the last column.

Table 2. Recently proposed deep-learning-based segmentation methods.

			Input	Network	Skip Connection		Supervision
Ref	Tumor/Organ	Modality	Type	Structure	Sum	Conc.	Level	# Training	Note
[102]	Rectal Tumor	MRI	2D	CNN			Fully	70
[103]	Renal Tumor	CT	3D	FCN(E-D)		$\surd$	Fully	89	Dice coefficient loss
[104]	Portal vein	CT	2.5D	FCN			Fully	64	Used with MRF
[105]	Brain Tumor	MRI	3D	FCN(E-D)	$\surd$	$\surd$	Fully	285	Ensemble, multi-label
[106]	Brain Tumor	MRI	3D	CNN			Fully	48	Used with CRF
[107]	Brain Tumor	MRI	3D	FCN			Fully	30	Local and global feature, cascaded
[108]	Brain Tumor	MRI	3D	FCN	$\surd$		Fully	274
[109]	Brain Tumor	MRI	2D	CNN			Fully	10	Holistically nested
[110]	Brain Tumor	MRI	3D	CNN			Fully	265	Separate validation
[111]	Brain Tumor	MRI	2.5D	CNN			Fully	15	Multi-source
[112]	Hippocampus	MRI	3D	FCN(E-D)		$\surd$	Fully	637	Multi-task
[113]	Headneck	CT	3D	CNN			Fully	33	Interleaved network
[114]	Brain Tumor	MRI	2.5D	CNN			Fully	15	Atlas probability
[115]	Breast/fibroglandular	MRI	2D	FCN(E-D)		$\surd$	Fully	39	Two consecutive U-nets
[116]	Brain structures	MRI/US	2D	CNN			Fully	40/45	Hough voting, multi-modality
[117]	Prostate	MRI	3D	FCN(E-D)	$\surd$	$\surd$	Fully	50
[118]	Multi-organ	CT	2.5D	FCN(E-D)	$\surd$		Fully	230
[119]	Prostate	CT	2D	CNN			Fully	73	Refined with Atlases
[120]	Thoracic organs	CT	2D	RNN/FCN(E-D)		$\surd$	Fully	25	CRF as RNN
[121]	Thoracic organs	CT	3D	FCN(E-D)		$\surd$	Fully	25	Collaborative architectures
[122]	Multi-organ	CT	3D	FCN			Fully	112
[123]	Spine	CT	2D	FCN			Fully	32	Redundant class label
[124]	Liver tumor	CT	2D	FCN(E-D)			Fully	130	Hierarchical
[125]	Head Neck	CT	2D	CNN			Fully	40
[126]	Meningiomas	MRI	3D	FCN	$\surd$		Fully	249
[127]	Lung Tumor	PET	3D	FCN(E-D)		$\surd$	Fully	50
[128]	Head Neck	CT	3D	FCN(E-D)	$\surd$	$\surd$	Fully	261	Squeeze-and-excitation
[129]	Multi-organ	CT	3D	FCN(E-D)		$\surd$	Fully	48	Dose calculation
[130]	Brain Tumor	MRI	2D	RNN/FCN(E-D)		$\surd$	Fully	274	CRF as RNN
[131]	Multi-organ/brain	MRI	2D	FCN	$\surd$	$\surd$	Fully	10/54	Interactive
[132]	Brain	MRI	3D	FCN	$\surd$		Fully	5	multi-modality/level information
[133]	Liver	CT	2D	FCN(E-D)	$\surd$		Fully	131	Squeeze-and-excitation
[134]	Prostate	MRI	3D	FCN(E-D)	$\surd$	$\surd$	Fully	50	Boundary-weighted domain adaptive
[135]	Head Neck	PET	3D	FCN(E-D)		$\surd$	Weakly	47	Using Bounding box

Organ/Tumor: Segmentation of various organs and tumors has been widely reported in the literature, such as segmentation of brain tumors, rectal tumors, liver tumors, etc. Among those reports, brain image segmentation seems to be the most popular research topic, partially due to the well-organized and publicly available datasets through the reputable competitions such as Brain Tumor Segmentation (BRATS) challenges from 2012 to 2018 and Ischemic Stroke Lesion Segmentation from 2015 to 2018.

Modality: In radiotherapy, imaging modalities are selected for diagnostics, treatment planning, and follow-up. Such medical images are critical to identify treatment targets and to spare normal tissues/organs from significant radiation damage. CT and MRI are two popular imaging modalities used in this context. CT can be used to estimate electronic density, which enables dosimetric calculation in radiotherapy. For soft tissue anatomies, MRI is more advantageous, depicting both anatomy and pathology of brain, prostate, and so on. A future research direction is to combine CT and MRI for radiotherapy [136, 137] or to synthesize one from another via deep learning [138, 139].

Input Type: The input to networks for segmentation can be classified into three types based on the dimensionality of images: 2D, 2.5D, and 3D. The use of 2D images as input allows more training examples, which is helpful in a case of limited data but these images miss 3D spatial information relevant for organ/lesion segmentation. The use of 3D input keeps the 3D spatial information and promises improved segmentation quality but the training process needs a large memory and a longer computational time. Usually, the more patients are involved for training, the more costly is the development effort. As a compromise between 2D and 3D inputs, 2.5D input can be used, which feeds multiple 2D slices from a 3D volume into a neural network. It is computationally efficient and yet utilizes 3D spatial information to a good degree. To boost a training set, a practical way is to use overlapping patches to train a deep learning model, and the testing set should not intersect with the training and validation sets.

Network Structure: The basic network architectures are roughly classified into the following three categories: CNN, FCN or FCN(E-D), and RNN. Previously, the segmentation problem was treated as a classification problem, where the input is an image patch, and the corresponding label is the class. Then, attempts were made to generate a segmentation map directly from the input using either fully connected, convolutional, or encoder-decoder networks. Furthermore, the recurrent neural network was also introduced to learn the spatial relationship among features extracted from CNN [140]. It should be noted that a capsule network [141], which is a new modularized architecture, was also applied to image segmentation [142].

Skip Connection: In addition to the typical (de)convolutional layers, skip connections play an important role in sharing information and increasing the performance of networks. Two common operations followed by a skip connection are the summation and concatenation. Skip connection with a summation operation, also called a residual skip connection, enables the network to learn the residual information between input and output, and dramatically increase the depth of network without suffering from the gradient vanishing problem [14]. Different from the residual skip connection, a concatenation skip connection copies feature-maps from earlier layers and reuse them as the input to later layers, which carries more information than the residual skip connection. The encoder-decoder network is coupled with the concatenation skip connection to form a powerful network called U-net [21].

Supervision Level: Most of the deep learning-based segmentation methods are trained in a fully-supervised manner, where an ideal or very accurate segmentation map of target tumors or organs are provided. However, the manual delineation of a large dataset is time-consuming and subjective with large inter- and intra-expert variablities. To address this challenge, weakly supervised or semi-supervised segmentation methods are worth of exploration. Weakly-supervised methods may not have the accurate label, but they have alternative labels that can be annotated relatively easily. For example, the precise segmented maps can be approximated by bounding boxes [135] and extreme points [143]. Semi-supervised learning only relies on labels for a few samples, but does not have labels for the other samples.

Training Data: A moderately sized training dataset is necessary to train a good segmentation model. As summarized in table 2, most studies take 20∼300 patient scans for training a segmentation model. In a case of lacking sufficient training data, transfer learning or domain adaption can be used to transfer knowledge from related data. Alternatively, it is viable to reduce the model size to avoid over-fitting a small dataset.

The loss functions for optimizing a segmentation model can be roughly divided into two classes: cross-entropy- and dice-based measures. There are variants such as weighted cross-entropy, focal loss [144], and Tversky loss [145]. Please refer to [146] for a taxonomy in the context of medical image segmentation. To validate a trained model, image metrics such as intersection, pixel accuracy, Hausdorff distance, and dice similarity are often used. A good reference is Taha et al [147] which describes 20 image metrics widely used in the image segmentation community.

3.3. Image registration

Image registration is the process of transforming different types of images into a common coordinate system, where information gained from two or more images is usually complementary [148, 149]. Image registration plays a key role in many tasks such as multi-modality imaging, adaptive treatment planning, image-guided radiotherapy, and prognostic assessment [150]. With the rapid development of deep learning techniques, deep-learning-based methods are being constantly developed, refreshing the landscape of image registration research.

Here we highlight some recently proposed deep learning-based image registration methods in table 3, including both supervised and unsupervised methods. Many iterative methods are not included due to their inferior performance and slow speed. We will discuss the selected deep-learning-based registration methods in terms of label, transformation, input/output, modality, body region, and network architecture.

Table 3. Recently proposed deep-learning-based registration methods.

Ref	Label	Transformation	Input/Output	Modality	Region	Network Structure
[154]	Synthetic	Rigid	2D/3D	DRR/x-ray	Bone	CNN
[155]	Synthetic	Rigid	2D/3D	CT	Thorax	SVRNet
[156]	Synthetic	Rigid	2D/3D	x-ray	Bone	CNN
[157]	Synthetic	Rigid	2D/2D	MRI	Brain	CNN/FCN
[158]	Synthetic	Rigid	3D/3D	MRI	Brain	AIRNet
[159]	Synthetic	Rigid	2D/2D	MR/TRUS	Prostate	GAN
[160]	Synthetic	Deformable	3D/3D	MRI	Brain	CNN
[161]	Synthetic	Deformable	3D/3D	CT	Chest	RegNet
[162]	Synthetic	Deformable	3D/3D	CT/US	Liver	DVNet
[163]	Synthetic	Deformable	2D/2D	MRI	Brain/Cardiac	FlowNet
[164]	Synthetic	Deformable	3D/3D	CT	Chest	U-Net-Advanced
[152]	Real	Deformable	2D/2D	MRI	Abdominal	CNN
[151]	Real	Deformable	3D/3D	MRI	Brain	FCN
[165]	Real	Deformable	3D/3D	MRI	Brain	VoxelMorph CNN
[166]	N/A	Deformable	2D/3D	MRI	Brain	ICNet
[167]	N/A	Deformable	3D/3D	MRI	Brain	3D UNet
[168]	N/A	Deformable	3D/3D	MRI	Brain	VoxelMorph CNN
[153]	N/A	Deformable	3D/3D	CT	Liver	CycleGAN

Label: The ground-truth for image registration is usually difficult to obtain clinically. Thus, in most studies synthetic ground-truth data were used to train and validate registration methods. The real ground-truth may be available for special tasks, such as registration to atlas [151] or respiratory motion correction [152]. Due to the difficulty in acquisition of ground-truth, unsupervised image registration methods attract much attention using an unpaired training strategy, such as cycleGAN [153].

Transformation: The transformation methods used for image registration can be either rigid or deformable. Rigid transformation is a geometric transformation in a Euclidean space that preserves the Euclidean distance between every pair of points, which includes rotation, translation, reflections or their combination. However, in practice patients may have anatomical features changed non-rigidly, due to weight loss, tumor shrinkage, and/or physiological variation, which cannot be modeled with rigid transformation. In contrast, deformable transformation has a great degree of freedom to establish the correspondence for key points before and after deformation [169].

Input/Output: The input and output of the network are usually of the same dimensionality; for example, in the registration between two 2D images [157] or two 3D volumes [158]. However, special cases exist, such as a 2D/3D registration problem, which means, for example, finding a best match between one or more intra-operative x-ray projections of the patient and the preoperative 3D volume [154, 170].

Modality: CT and MRI remain two most popular modalities being studied as they are widely used in clinical routines. The registration happens for either the same modality [153, 160] or across multiple modalities such as CT and MRI. Other imaging modalities also have registration needs; for example, ultrasound (US) to CT image registration [162] and MRI to transrectal ultrasound (TRUS) image registration [171].

Organ: Similar to the image segmentation methods, most image registration methods focus on brain images due to importance of the brain and public availability of large datasets and atlas. For interventional guidance applications, the targeted organs include the prostate, the liver, and the lungs.

Network Structure: Popular network structures are all used for image registration. Also, the type of networks depends on the transformation in use. For rigid transformation, typically CNNs are used to learn the parameters. For deformable transformation, a FCN or FCN(E-D) can be used to model an underlying deformation field. Notably, the GAN has been used to enhance the registration performance in supervised and unsupervised learning [153, 159].

When optimizing the rigid registration model, mean-squared error is commonly used. For the deformable registration model, an image similarity metric is typically preferred, including the intensity sum of squared distance (SSD), mean squared distance (MSD), correlation ratio (CR), (normalized) cross-correlation (CC/NCC), (normalized) mutual information (MI/NMI), etc [172]. Likewise, those image similarity metrics can be used to evaluate the trained registration models.

3.4. Radiomics

Radiomics refers to extraction and analysis of comprehensive features in medical images such as from low dose CT [173–175]. The key idea behind radiomics is that images contain more information than what can be visualized by radiologists, and sophisticated algorithms can be designed to distill hidden information.

Traditional handcrafted features can be divided into shape-based features and texture-based features. Shape-based features describe a lesion of interest heuristically; for example, the total volume, surface area, surface-to-volume ratio, and lesion compactness. Texture-based features include second-order statistics, length of run, co-occurrence matrix, histogram of oriented gradients, local binary pattern, and so on [176–180]. Different from the handcrafted features, CNNs promise comprehensive multi-scale features for deep radiomics. Currently, deep learning methods handle millions (even billions) of parameters [13, 14, 23], resulting in an ultra-high dimensional representation. In a practical sense, deep learning discovers intricate features in large datasets to deliver an unprecedented power for representation, classification and prediction.

Table 4 summarizes the recently proposed deep-learning-based radiomics results that are highly related to radiotherapy. For clarity, we cover the following aspects: cancer type, modality, input type, training strategy, involvement of handcrafted features, network structure, and single or multiple tasks.

Table 4. Recently proposed deep-learning-based radiomics.

			Input		Training		Handcrafted	Network		Task
Ref	Cancer	Modality	2D	3D	Transfer	Scratch	features	CNN	E-D	Single	Multi	Notes
[184]	Brain	MRI	$\surd$		$\surd$		$\surd$	$\surd$		$\surd$		multi-scale
[185]	Brain	MRI	$\surd$		$\surd$		$\surd$	$\surd$		$\surd$
[186]	Brain	MRI	$\surd$		$\surd$			$\surd$		$\surd$
[187]	Brain	MRI		$\surd$	$\surd$			$\surd$		$\surd$
[188]	Lung	CT		$\surd$		$\surd$		$\surd$		$\surd$
[189]	Lung	CT		$\surd$		$\surd$		$\surd$	$\surd$		$\surd$
[190]	Lung	CT			$\surd$			$\surd$		$\surd$
[191]	Lung	CT	$\surd$		$\surd$			$\surd$		$\surd$
[192]	Lung	Histopathology	$\surd$		$\surd$			$\surd$		$\surd$
[193]	Lung	CT		$\surd$		$\surd$		$\surd$	$\surd$	$\surd$
[194]	Lung	CT		$\surd$	$\surd$			$\surd$			$\surd$
[195]	Lung	CT		$\surd$				$\surd$				multi-scale
[196]	Breast	Mammography			$\surd$		$\surd$	$\surd$		$\surd$
[197]	Breast	Pathology	$\surd$			$\surd$		$\surd$		$\surd$
[198]	Breast	Mammography	$\surd$			$\surd$		$\surd$		$\surd$	$\surd$	multi-scale, multi-view
[199]	Breast	Mammography			$\surd$	$\surd$		$\surd$		$\surd$		multi-scale, multi-view
[200]	Renal	CT	$\surd$		$\surd$			$\surd$		$\surd$		multi-view
[201]	Retinal	Retinal	$\surd$			$\surd$		$\surd$		$\surd$
[182]	Retinal	Retinal	$\surd$			$\surd$		$\surd$		$\surd$
[202]	Prostate	MRI	$\surd$			$\surd$		$\surd$		$\surd$
[203]	Cervical	Histology	$\surd$			$\surd$	$\surd$	$\surd$		$\surd$
[204]	Head and neck	Hyperspectral	$\surd$			$\surd$		$\surd$		$\surd$
[205]	Bladder	CT	$\surd$			$\surd$		$\surd$		$\surd$

Cancer Type: Radiotherapy is a cancer treatment approach that uses high dose radiation to kill cancer cells in the brain, head and neck, breast, cervix, prostate, eyes, and livers.

Modality: Various modalities are used to diagnose different cancers. For example, brain cancers are diagnosed by MRI, lung cancers by CT, breast cancers mainly by mamography, tomosynthesis, CT, MRI and ultrasound imaging, and retinal tumors by MRI and optical imaging. Also, algorithms can extract information from pathological images. The combination of different modalities provides complementary radiomics information.

Input Type: The input to deep learning algorithms includes 2D/3D patches or images/volumes of interest. However, 3D cubes usually requires a much larger memory than 2D patches.

Training Strategy: The deep learning model can be trained based on a pre-trained model (transfer learning) or from scratch. Considering limited data in many cases for training a large network, the transfer learning strategy is critically important to transfer the knowledge from one domain such as natural images in ImageNet [181] to another domain, in particular medical images. However, recently Raghu et al [182] showed that the transfer learning strategy could only offer a limited performance gain while much smaller architectures can perform comparably to the standard model. Further investigations are clearly needed.

Involvement of Handcrafted Features: Although the features learnt from deep neural networks are rich, the combination of deep features and handcrafted features seems advantageous in some applications; for example, with those handcrafted features that are widely used by radiologists.

Network Structure: CNN becomes a dominating network architecture for radiomics as it can be viewed as a classification task in computer vision. CNN networks such as ResNet [14] and DenseNet [15] can be thus adapted for radiomics. Some networks contain a decoder so that the extracted features can not only predict labels but also reconstruct the original signals. Afshar et al recently adapted the capsule network [141] for brain tumor radiomics [183].

Single or Multiple Tasks: Most algorithms solve a single task. On the other hand, multi-task algorithms were also developed to solve multiple tasks at the same time and capitalize synergies among the tasks. By doing so, the prediction accuracy and robustness can be improved.

Note that some algorithms used multi-view or multi-scale analysis of a tumor. Multi-view data offer different perspectives of the tumor in a 3D space, while multi-scale features allow a structured understanding of the tumor. They could be also viewed as ways to augment the training data and facilitate ensemble learning.

Since radiomics means classification tasks, the classification losses such as cross-entropy or focal loss can be used to train the model. Note that focal loss addresses the imbalanced classification [144]. To validate the trained model, accuracy, precision, recall, F1 score, ROC and AUC are well-justified metrics to evaluate the performance.

Now, let us revisit the concept mentioned at the beginning – 'rawdiomics' as shown in figure 1. With machine learning, radiomics has gained a new momentum. The so-called radiomics is comprehensive image analysis, or a process from an image to a list of comprehensive features that are hands-crafted and/or network-extracted. What we propose is to generalize the concept of radiomics to a new concept of rawdiomics that goes from raw tomographic data to final features, with or without explicitly reconstructed images [206–208]. For example, we could reconstruct complementary images using different algorithms from the same raw dataset, and perform radiomics of all these images systematically. In this way, each reconstructed image is a different feature channel so that the space of features is significantly enlarged. Alternatively, diagnostic features and decisions can be directly mined from raw data [209].

4. Radiotherapy

Application of deep learning techniques is an emerging trend in the field of radiotherapy. This section, let us review recent advancements of machine learning, particularly deep learning, along the chain of radiotherapy treatment planning and delivery steps, including treatment plan optimization, plan quality assurance, treatment delivery monitoring, and outcome prediction. As it is impossible to cover all the progresses, we will point out representative major studies in each aspect. Interested readers can find more studies in the literature.

4.1. Treatment planning

The typical process of treatment planning starts from the acquisition of planning images, such as CT, MRI, and PET images. These images are registered together (see section 3.3 on image registration), and then the target volumes and organs at risks are delineated manually or automatically (see section 3.2 on image segmentation.) The goal of treatment planning is to generate an optimal plan for an individual patient to meet clinical requirements, i.e. achieving sufficient tumor coverage of dose while keeping normal organ doses to an acceptable level or minimized.

Treatment planning techniques for modern radiotherapy, e.g. intensity modulated radiotherapy (IMRT) [81] or volumetric modulated arc therapy (VMAT) [210], are usually formulated as an optimization problem. The objective function for optimization has multiple terms and constrains corresponding to various clinical or practical considerations. A number of parameters exist to precisely define these terms and constraints, such as their priorities. A treatment planning system is capable of solving the optimization problem with a certain optimization algorithm. Nonetheless, deciding the exact parameter values is typically beyond the capability of the treatment planning system. Because the parameter values for the optimal solution are patient-specific, and the optimal solution for a patient cannot be perfectly known in a treatment planning practice, a planner must adjust these parameters in a trial-and-error fashion based on his/her experience to explore the solution space. Moreover, since the physician ultimately decides acceptance of a plan, the planner has to consult the physician about plan quality, as well as directions to improve the plan if the quality is not acceptable. Not only is this process tedious and time-consuming, the final plan quality is affected by many factors, such as experience of the planner, available time, and interactions between the planner and the physician. Hence, extensive efforts are being devoted to the automation of this process for high-quality plans meeting the physician's requirements in a timely fashion.

To provide the guidance about the best achievable plan for each individual patient, the regime of knowledge based planning is necessary [211–217]. Methods in this category use a certain machine learning or deep learning algorithm to establish a correlation between patient anatomy and the best achievable dose volume histogram (DVH), a statistical summary of a dose distribution commonly used to evaluate a plan. Along the same direction and empowered by the capability of image-to-image mapping of deep learning methods, recent studies demonstrated the feasibility to directly predict the best achievable dose distribution, as opposed to predicting its statistical summary of DVH. Kajikawa et al first employed an AlexNet CNN [13] to determine if the plan for a patient with prostate cancer can meet all the dosimetric constraints in treatment planning [218]. Subsequently, a number of groups successfully used different CNNs to directly map contours of targets and organs to dose distributions in different tumor sites [219–223]. The result from an example study [223] is shown in figure 3. Lately, the feasibility is also shown for special radiotherapy modalities such as helical therapy [224]. Advanced features modeling the dose distribution, such as isodose feature-preserving voxelization, were incorporated into the CNN-based prediction model to improve model accuracy and reliability [225, 226]. These models are expected to guide the planner towards the best plan, which makes the planning process more effective and efficient.

Figure 3. Refer to the following caption and surrounding text. — **Figure 3.** Dose prediction for treating head and neck cancer. The color bar is shown in the unit of Gy. The clinical ground truth dose is on the top row, followed by the dose predictions of the hierarchically densely connected U-net (HD U-net), Standard U-net, and DenseNet, respectively. Low dose cutoff for viewing was chosen to be 5% of the highest prescription dose (3.5 Gy). Note that all of the models predict more dose on the back of the neck than the ground-truth, which may be caused by insufficient training. Adapted with permission from [223]. Copyright 2019 IOP Publishing.
Download figure:
Standard image High-resolution image

A human planner usually has the intuition to adjust the parameters in the optimization problem. In light of recent successes in DRL, to build a computer agent for task-specific decision making in a human-like fashion, a new framework called intelligent treatment planning was recently proposed [28, 227, 228]. This type of research aimed at building a virtual treatment planner using DRL to mimic the human's behavior of parameter adjustment in treatment planning. Shen et al demonstrated the feasibility and potential of this approach in a proof-of-principle study of inverse treatment planning for high dose-rate brachytherapy [229], a special form of radiotherapy. Extensions to treatment planning for external beam radiotherapy are currently in progress [230].

4.2. Quality assurance

To ease this quality assurance (QA) process, deep learning techniques have been utilized to perform virtual QA on plans by identifying those plans that fail the QA process. Interian et al [231] and Tomori et al [232] employed CNN models to predict a metric to quantify the success of QA, called gamma passing rate [233]. One step further, Nyflot et al developed a model to identify errors in the LINAC's multi-leaf collimator based on the measured beam fluence map [234]. The efforts of using deep learning techniques for the QA purpose were also made for proton radiotherapy. A model was developed to predict output factors of proton therapy treatment fields, which can be used for the sanity check of output factor measurements [235].

4.3. Treatment delivery

At the treatment delivery stage, it is critical to position the patient against the treatment beam following the planned geometry and to ensure accuracy of targeting the tumor with the beam under practical challenges, such as tumor motion due to respiration. Again, deep learning techniques are being developed to solve these problems.

On the positioning side, certain images of a patient has to be acquired, which should be registered to the treatment planning images to decide the correct patient position. For instance, x-ray projections of the patient anatomy using kilo-voltage (kV) or Mega-voltage (MV) beams are taken, and compared with projections computed using the treatment planning CT volume. Zhao et al developed a CNN method to identify the prostate target on a kV x-ray projection image, permitting accurate positioning of the target with respect to the therapy beam [236], as shown in figure 4. When comparing MV x-ray projections acquired at treatment stage with kV projections computed at the planning stage, different image characteristics, e.g. inconsistencies in intensity and contrast, caused by different radiation energies impede the image matching quality. To overcome this challenge, a conditional GAN approach was developed to convert MV images into kV images, which facilitates the MV-kV image matching [237].

Figure 4. Refer to the following caption and surrounding text. — **Figure 4.** Results of identifying the prostate target from 2D x-ray projection of the patient anatomy using a CNN. Prostate boundary boxes derived from the deep learning model is shown in yellow, and their corresponding ground truth is in blue. Adapted with permission from [236]. Copyright 2019 Elsevier.
Download figure:
Standard image High-resolution image

During the treatment delivery, it is necessary to monitor the tumor and patient motion to ensure beam targeting accuracy and patient safety. For this purpose, Chen et al used a CNN to automatically select the patient surface area that can be monitored under a surface image camera to yield the optimal motion monitoring performance [238]. Park et al developed an intra- and inter-fraction fuzzy deep learning (IIFDL) method to predict lung tumor motion inter- and intra-fractionally [239]. Using an attention-aware CNN with a convolutional LSTM network, real-time motion tracking of a liver tumor was made possible via a robotic-arm-mounted ultrasound imaging system [240]. Lin et al also developed a LSTM-based model to predict the patient respiratory signal in real-time [241].

4.4. Biological effects

Deep learning can predict various biological or clinical quantities to facilitate treatment planning or treatment outcome assessment for different disease sites and treatment techniques. For example, DNNs were used to extract prognostic signatures of quantitative imaging features that can stratify patients with non-small-cell lung cancer into low and high mortality risk groups [188]. The model was developed with 7 independent datasets across 5 institutions to ensure its robustness. Peeken et al used deep learning based free water correction of diffusion tensor imaging scans to estimate the infiltrative gross tumor volume of patients with glioblastoma, which is the basis for the definition of the treatment target of this disease [242]. Lee et al proposed a survival recurrent network for patients with gastric cancer, and established the association between molecular subtype of the disease and optimal adjuvant treatment [243]. Tseng et al employed a DRL approach to enable the adjustment of radiation dose during the course of radiotherapy for non-small cell lung cancer, maximizing tumor local control at a reduced rate of radiation pneumonitis [244]. These studies have demonstrated the power of deep learning in terms of providing critical information to support decision-making on treatment strategies.

4.5. Outcome prediction

Wang et al developed a deep learning approach to predict the spatial and temporal evolution of lung tumor during the course of radiotherapy using longitudinal MRI scans [245]. Cui et al designed a network structure to take advantage of the temporal associations among longitudinal data to predict local control of lung cancer after radiotherapy [246]. In terms of treatment toxicity, CNN was applied to discover the dosimetric patterns in treatment plans associated with hepatobiliary toxicities after liver stereotactic body radiotherapy [247]. The similar technique was also used for rectal toxicity prediction for patients with cervical cancer receiving high-dose-rate brachytherapy [248] and for predicting xerostomia in patients undergoing head and neck radiotherapy [249]. Cui et al combined traditional machine learning methods and deep learning techniques to predict lung pneumonitis after radiotherapy [250]. In the era of personalized medicine, more and more patient-specific data other than images are incorporated for response modeling. For instance, machine learning especially deep learning techniques can be applied in radiogenomics [251].

5. Perspective

Machine learning techniques will soon be utilized during all stages of radiotherapy, starting with diagnostic imaging and its use for target and organ at risk delineation (potentially using image registration), followed by automated planning and outcome assessment. The latter will provide valuable input for refinements in treatment planning towards personalized medicine. It is a most promising direction to combine deep-learning-based medical imaging and AI-driven radiotherapy. It is critically important for precision radiotherapy to perform high-throughput and quantitative analysis of comprehensive features in medical images such as CT and MRI images and other relevant data. A key insight is that images and other data contain significantly more information than what can be visually extracted by radiologists and oncologists, which can be effectively harvested using sophisticated algorithms to improve treatments and quality of life. AI-based imaging guided radiotherapy will be superior than classic workflows for two main reasons. First, some hidden features the reader cannot perceive can be utilized in radiomic analysis. Second, extensive prior knowledge and domain constraints can be utilized in the imaging and radio-therapeutic processes in a data-driven and end-to-end fashion, which is powerful; for example, to transform tomographic algorithms to a new level by compensating for model mismatches of many kinds. During the radiotherapeutic process, big data are generated on anatomical, functional, metabolic, pathological, cellular and molecular features, especially in the forms of tomographic images, genetic profiles, and medical reports [252]. These data can be structured into primitives and patterns, which can be interpreted as 'biological languages' in a general sense. On the other hand, radiological, pathological and oncological reports/notes are in natural languages. A major clinical challenge is that those 'biological languages' are extremely difficult to be extracted and presented in a meaningful way to be interpreted by radiologists/oncologists, and the sensitivity and specificity of the current medical reports, associated decisions and plans need major improvements to treat cancer patients much better than the current standards. It would be fruitful to synergize expertise in tomographic imaging and image analysis as well as natural language processing (NLP), optimize the treatment planning and improve prognosis of cancer patients, which represents a generalization of language concepts, an opportunity to synergize knowledge graphs of different forms, and a convergence of inverse problems, informatics, and knowledge engineering, with a strong application background of radiotherapy.

Figure 5 presents our vision along this direction, including the key elements to (1) process medical images and extract radiomic graphs; (2) integrate clinical reports and construct semantic graphs; and (3) develop graph transformation techniques to convert radiomic graphs to semantic graphs.

Figure 5. Refer to the following caption and surrounding text. — **Figure 5.** Big picture of AI-driven radiotherapy with graph neural network and natural language processing.
Download figure:
Standard image High-resolution image

There are several interesting topics demanding active investigation. First of all, we can extract treatment-related image features and organize them as graphs so that spatial and temporal dependencies among features will be captured. A graph-based learning system will be developed to quantify patient states and predict responses in terms of relevant data especially CT, nuclear and MRI images, treatment plans, and genomic profiling for personalized radiotherapy. For example, we could employ a ResNet model pre-trained on ImageNet to extract imaging biomarkers from medical images, dose distributions, and genomic data. To enhance the transferability and interpretability, we may fine-tune the pre-trained model with our real data [253] and visualize soft activation maps [193]. The visualization technique t-SNE [254] can also be used to visualize the learnt features in a low-dimensional manifold with respect to classification labels. Furthermore, knowledge graphs can be reconstructed for each patient using a graph neural network [255–257]. Constructed graphs will be embedded into a low-dimensional space with an adjusted weight for each edge, reflecting the strength between cancer growth/response patterns and specific data-driven features.

A medical semantic/knowledge graph from a patient's clinical reports and electronic medical records is invaluable to enable reasoning and guide planning. Given these text inputs, the domain knowledge can be extracted to construct a library of knowledge graphs using natural language processing techniques [258–262]. For example, to construct knowledge graphs from medical reports/notes, we could take advantage of online services such as the Amazon Comprehend Medical [263] or the Watson Natural Language Understanding platform [264] rather than from scratch. Based on the cloud service provided by these and other systems, we can distill and query high-quality domain-specific rules/knowledge graphs from unstructured or semi-structured contents extracted from images and data such as medical conditions, medication details (dosage, strength, and frequency), and other data from a variety of sources like doctors' notes, clinical trial reports, and patient health records.

With the aforementioned efforts, we will have both radiomics graphs and knowledge/semantic graphs from medical images and medical text data, respectively. These two types of graphs are in different domains: one from images and data biologically/clinically informative, and the other is in terms of professional languages directly interpretable by experts and understandable by an educated patient. Therefore, we need to bridge these two domains via an across-domain graph transformation. To this end, we would use a graph-based encoder-decoder network including the graph convolution [256], graph pooling [265], and graph unpooling [266]. The encoder will extract the information from a radiomic graph while the encoder will reconstruct a corresponding semantic graph. The bottleneck between the encoder and the decoder will serve as the bridge between the image and text domains. In the cases of dynamics changes of tumors with different reports, a graph-based RNN can be used to learn a dynamics graph mapping [267].

6. Conclusion

AI/ML is a scientific paradigm shift and has already generated great impacts on medical imaging and radiotherapy. Instead of claiming that deep learning is reaching its limits, we believe that deep learning is being developed into more advanced forms and enhanced with synergistic techniques especially NLP, entity resolution, and graph neural networks. In a larger perspective, deep learning demonstrates successes of connectionism but is also subject to limits of connectionism. Through NLP and its general forms enabled by knowledge graphs, it is hoped that connectionism and computionism will be unified to define the future of AI/ML, and showcase new successes in the medical physics world.

Acknowledgment

This work was partially support by NIH/NCI under award numbers R01CA233888, R01CA237267, R01CA227289, R37CA214639, and R01CA237269, and NIH/NIBIB under award number R01EB026646.

Data availability

Data sharing is not applicable to this article as no new data were created or analysed in this study.

Appendix A.

See table A1.

Table A1. Acronyms and their full names used in this paper.

Acronym	Full name
AI	Artificial intelligence
ML	Machine learning
DL	Deep learning
ANN	Artificial neural network
MLP	Multilayer perceptron
CNN	Convolutional neural network
ReLU	Rectified linear unit
FCN	Fully convolutional network
RED-CNN	Residual encoder-decoder convolutional neural network
GAN	Generative adversarial network
RNN	Recurrent neural network
LSTM	Long short-term memory
GRU	Gated recurrent units
DRL	Deep reinforcement learning
CT	Computed tomography
MRI	Magnetic resonance imaging
PET	Positron emission tomography
SPECT	Single-photon emission computed tomography
ROC	Receiver operating characteristic
AUC	Area under curve
US	Ultrasound
TRUS	Transrectal ultrasound
IMRT	Intensity modulated radiationtherapy
VMAT	Volumetric modulated arc therapy
DVH	Dose volume histogram
QA	Quality assurance
kV	Kilo-voltage
MV	Mega-voltage
IIFDL	Intra- and inter-fraction fuzzy deep learning
NLP	Natural language processing

Author e-mails

Author affiliations

ORCID iDs

Dates

Peer review information