Abstract
We characterize and remedy a failure mode that may arise from multi-scale dynamics with scale imbalances during training of deep neural networks, such as physics informed neural networks (PINNs). PINNs are popular machine-learning templates that allow for seamless integration of physical equation models with data. Their training amounts to solving an optimization problem over a weighted sum of data-fidelity and equation-fidelity objectives. Conflicts between objectives can arise from scale imbalances, heteroscedasticity in the data, stiffness of the physical equation, or from catastrophic interference during sequential training. We explain the training pathology arising from this and propose a simple yet effective inverse Dirichlet weighting strategy to alleviate the issue. We compare with Sobolev training of neural networks, providing the baseline of analytically ε-optimal training. We demonstrate the effectiveness of inverse Dirichlet weighting in various applications, including a multi-scale model of active turbulence, where we show orders of magnitude improvement in accuracy and convergence over conventional PINN training. For inverse modeling using sequential training, we find that inverse Dirichlet weighting protects a PINN against catastrophic forgetting.
Export citation and abstract BibTeX RIS
Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
1. Introduction
Data-driven modeling has emerged as a powerful and complementary approach to first-principles modeling. It was made possible by advances in imaging and measurement technology, computing power, and innovations in machine learning, in particular neural networks. The success of neural networks in data-driven modeling can be attributed to their powerful function approximation properties for diverse functional forms, from image recognition [1] and natural language processing [2] to learning policies in complex environments [3].
Recently, there has been a surge in exploiting the versatile approximation properties of neural networks for surrogate modeling in scientific computing and for data-driven modeling of physical systems [4, 5]. These applications hinge on the approximation properties of neural networks for Sobolev-regular functions [6, 7]. The popular physics informed neural networks (PINNs) rely on knowing a differential equation model of the system in order to solve a soft-constrained optimization problem [5, 8]. Thanks to their mesh-free character, PINNs can be used for both forward and inverse modeling in domains including material science [9–11], fluid dynamics [8, 12–14] and turbulence [15, 16], biology [17], medicine [18, 19], earth science [20], mechanics [21] and uncertainty quantification [22], as well as for solving stochastic [23], high-dimensional [24] and fractional differential equations [25].
The wide application scope of PINNs rests on their parameterization as deep neural networks, which can be tuned by minimizing a weighted sum of data-fitting and equation-fitting objectives. However, finding network parameters that impartially optimize for several objectives is challenging when objectives impose conflicting requirements. In addition, PINNs suffer from spectral bias [26–29], potentially resulting in large discrepancies between convergence rates of different objectives during training [30]. For multi-task learning (MTL) problems, this issue has been addressed by various strategies for relative weighting of conflicting objectives, e.g. based on uncertainty [31], gradient normalization [32] or Pareto optimality [33]. The PINN community in parallel has developed heuristics for loss weighting with demonstrated gains in accuracy and convergence [34, 35]. However, there is neither consensus on when to use such strategies in PINNs, nor are there optimal benchmark solutions for objectives based on differential equations.
Here, we fill this gap by characterizing training pathologies of PINNs and providing criteria for their occurrence. We mathematically explain these pathologies by showing a connection between PINNs and Sobolev training, and we propose a strategy for loss weighting in PINNs based on the Dirichlet energy of the task-specific gradients. We show that this strategy reduces optimization bias and protects against catastrophic forgetting. We evaluate the proposed inverse Dirichlet weighting by comparing with Sobolev training in a case where provably optimal weights can be derived, and by empirical comparison with two conceptually different state-of-the-art PINN weighting approaches.
2. Training a physics informed neural network is a multi-objective optimization problem
An optimization problem involving multiple tasks or objectives can be written as:
The total number of tasks is given by K. are shared parameters between the tasks, and are the task-specific parameters. For conflicting objectives, there is no single optimal solution, but Pareto-dominated solutions can be found using heuristics like evolutionary algorithms [36]. Alternatively, the multi-objective optimization problem can be converted to a single-objective optimization problem using scalarization, e.g. as a weighted sum with real-valued relative weights λk associated with each loss objective :
Often, the λk are dynamically tuned along training epochs [33]. The weights λk intuitively quantify the extent to which we aim to minimize the objective . Large λk favor small , and vice-versa [37]. For K > 2, manual tuning of the λk and grid search become prohibitively expensive for large parameterizations as common in deep neural networks.
The loss function of a PINN is a scalarized multi-objective loss, where each objective corresponds to either a data-fidelity term or a constraint derived from the physics of the underlying problem [4, 5]. A typical PINN loss contains five objectives:
We distinguish the neural network approximation from the training data sampled at the space-time points in the interior of the solution domain Ω, points on the boundary , and samples of the initial state at time t = 0. The λ1, λ2, and λ3 weight the residuals of the physical equation with right-hand side and coefficients Σ, the boundary condition , and the initial condition , respectively. The objective with weight λ4 imposes an additional constraint on the state variable u , such as, e.g. divergence-freeness () of the velocity field of an incompressible flow. The weight λ5 penalizes deviations of the neural network approximation from the available measurement data.
PINNs can be used to solve both forward and inverse modeling problems of ordinary differential equations (ODEs) and partial differential equations (PDEs). In the forward problem, a numerical approximation to the solution of an ODE or PDE is to be learned. For this, are the networks parameters and is an empty set. In the inverse problem, the coefficients of an ODE or PDE are to be learned from data such that the equation describes the dynamics in the data. Then, the task-specific parameters correspond to the coefficients () of the physical equation that shall be inferred. During training of a PINN, the parameters are determined by minimizing the total loss for given data.
2.1. Sobolev training is a special case of PINN training
Like most deep neural networks, PINNs are usually trained by first-order parameter update based on (approximate) loss gradients. The update rule can be interpreted as a discretization of the gradient flow, i.e.
The learning rate depends on the training epoch τ, reflecting the adaptive step size used in optimizers like Adam [38]. The gradients are batch gradients , where is a mini-batch, created by randomly splitting the N training data points into distinct partitions, and is one summand (for one data point i) of the loss corresponding to the kth objective. For batch size , mini-batch gradient descent reduces to stochastic gradient descent algorithm.
For approximating a function from data u i , adding derivatives , , of the state variable into the training loss has been shown to improve data efficiency and generalization power of neural networks [7]. Minimizing the resulting loss
using the update rule in equation (4) can be interpreted as a finite approximation to the Sobolev gradient flow of the loss functional. Without loss of generality, we drop the dependence of the state variable on time, writing . The neural network is then trained with access to the (K + 2)-tuples instead of the usual training set . Such Sobolev training is an instance of scalarized multi-objective optimization with each objective based on the approximation of a specific derivative. The weights are usually chosen as [7].
The additional information from the derivatives introduces an inductive bias, which has been found to improve data efficiency and network generalization [7]. This is akin to PINNs, which use derivatives from the physical equation model to achieve inductive bias. Indeed, the Sobolev loss in equation (5) is a special case of the PINN loss from equation (3) with each objective based on a PDE with right-hand side , .
2.2. Vanishing task-specific gradients are a failure mode of PINNs
The update rule in equation (4) is known to lead to stiff learning dynamics for loss functionals whose Hessian matrix has large positive eigenvalues [34]. This is the case for Sobolev training, even with K = 1, when the function is highly oscillatory, leading us to the following proposition, proven in appendix B.1: Stiff learning dynamics in first-order parameter update
Proposition 2.2.1. For Sobolev training on the tuples , with total loss , the rate of change of the residual of the dominant Fourier mode using first-order parameter update is:
where is the Bachmann-Landau big-O symbol. For , this leads to training dynamics with slow time scale and fast time scale for .
This proposition suggests that losses containing high orders of derivatives m, dominant high frequencies k 0, or a combination of both exhibit dominant gradient statistics. This can lead to biased optimization of the dominant objectives with large gradients at the expense of the other objectives with relatively smaller gradients. This is akin to the well-known vanishing gradients phenomenon across the layers of a neural network [39], albeit now across different objectives of a multi-objective optimization problem. We therefore call this phenomenon vanishing task-specific gradients.
Proposition 2.2.1 also admits a qualitative comparison with the phenomenon of numerical stiffness in PDEs of the form
If the real parts of the eigenvalues of L, , then the solution of the above PDE has fast modes with and slow modes with . Therefore, high frequency modes evolve on shorter time scales than low frequency modes. The rate amplitude for spatial derivatives of order m is for the wavenumber k [40]. Due to this analogy, we call the learning dynamics of a neural network with vanishing task-specific gradients stiff. Stiff learning dynamics leads to discrepancy in the convergence rates of different objectives in a loss function.
Taken together, vanishing task-specific gradients constitute a likely failure mode of PINNs, since their training amounts to a generalized version of Sobolev training.
3. Strategies for balanced PINN training
Since PINN training minimizes a weighted sum of multiple objectives (equation (3)), the result depends on appropriately chosen weights λk . The ratio between any two weights , i ≠ j, defines the relative importance of the two objectives during training. Suitable weighting is therefore quintessential to finding a Pareto-optimal solution that balances all objectives.
3.1. Weighting based on mean gradient statistics
While ad hoc manual tuning is still commonplace, it has recently been proposed to determine the weights as moving averages over the inverse gradient magnitude statistics of a given objective as [34]:
where is the gradient of the residual, i.e. of the first objective in equation (3), and α is a user-defined learning rate (here always α = 0.5). All gradients are taken with respect to the shared parameters across tasks, and is the component-wise absolute value. The overbar signifies the algebraic mean over the vector components. The maximum in the numerator is over all components of the vector. While this max/avg weighting has been shown to improve PINN performance [34], it does not completely alleviate the problem of vanishing task-specific gradients.
3.2. Inverse Dirichlet weighting
Instead of determining the loss weights proportional to the inverse average gradient magnitude, we here propose to use weights based on the gradient variance. In particular, we propose to use weights for which the variances over the components of the back-propagated weighted gradients become equal across all objectives, thus directly preventing vanishing task-specific gradients. This can be achieved in any (stochastic) gradient-descent optimizer by using the update rule in equation (4) with weights
where , and is the empirical (sample) standard deviation over the vector components. Under the assumption that the back-propagated gradients are normally distributed, the weighting strategy in equation (8) leads to balanced gradient distributions with variance , i.e.
is the normal distribution with mean , and given variance.
For optimizers that are invariant to diagonal scaling of the gradients, such as the popular Adam optimizer [38], the weights in equation (8) can be efficiently computed as the inverse of the square root of the Dirichlet energy of each objective, i.e.
We therefore refer to this weighting strategy as inverse Dirichlet weighting.
The weighting strategies in equations (7) and (8) are fundamentally different. Equation (7) weights objectives in inverse proportion to certainty as quantified by the average gradient magnitude. Equation (8) uses weights that are inversely proportional to uncertainty as quantified by the variance of the loss gradients. Equation (8) is also different from previous uncertainty-weighted approaches [31] because it measures the training uncertainty stemming from variance in the loss gradients rather than the uncertainty from the model's observational noise.
3.3. Gradient-based multi-objective optimization
We compare weighting-based scalarization approaches with gradient-based multi-objective approaches using Karush-Kuhn-Tucker (KKT) local optimality conditions to find a descent direction that decreases all objectives. Specifically, we adapt the multiple gradient descent algorithm (MGDA) [33] to PINNs. This algorithm leverages the KKT conditions
to solve an MTL problem. The MGDA is guaranteed to converge to a Pareto-stationary point [41]. Given the gradients of all objectives with respect to the shared parameters, we find weights λk that satisfy the above KKT conditions. The resulting λ leads to a descent direction that improves all objectives [33], see also figure A.1. We adapt the MGDA to PINNs as described in appendix A.2. We refer to this adapted algorithm as pinn-MGDA.
4. Results
As we posit above, PINNs are expected to fail for problems with dominant high frequencies, i.e. with high k 0 or high derivatives m, as is typical for multi-scale PDE models. We therefore present results of numerical experiments for such cases and compare the approximation properties of PINNs with different weighting schemes. This showcases training failure modes originating from vanishing task-specific gradients.
We first consider Sobolev training of neural networks, which is a special case of PINN training, as discussed in section 2.1. Due to the linear nature of the Sobolev loss (equation (5)), this test case allows us to compute ε-optimal analytical weights (see appendix A.1), providing a baseline for our comparison. Second, we consider a real-world example of a nonlinear PDE, using PINNs to model active turbulence.
In the first example, we specifically consider the Sobolev training problem with target function derivatives up to fourth order, i.e. K = 4, and total loss
The neural network is trained on the data and their derivatives. To mimic inverse-modeling scenarios, in addition to producing an accurate estimate of the function, the neural network is also tasked to infer unknown scalar pre-factors with true values chosen as . This mimics scenarios in which unknown coefficients of the PDE model also need to be inferred from data. For this loss, ε-optimal weights can be analytically determined (see appendix A.1) such that all objectives are minimized in an unbiased fashion [42]. This provides the baseline against which we benchmark the different weighting schemes.
4.1. Inverse Dirichlet weighting avoids vanishing task-specific gradients
We characterize the phenomenon of vanishing task-specific gradients, demonstrate its impact on the accuracy of a learned function approximation, and empirically validate Proposition 2.2.1. For this, we consider a generic neural network with five layers and 64 neurons per layer, tasked with learning a 2D function that contains a wide spectrum of frequencies and unknown coefficients using Sobolev training with the loss from equation (10). The details of the test problem and the training setup are given in appendix
We first confirm that in this test case vanishing task-specific gradients occur when using uniform weights, i.e. when setting for all . For this, we plot the gradient histogram of the five loss terms in figure 1(A). The histograms clearly show that training is biased in this case to mainly optimize the objective containing the highest derivative k = 4. All other objectives are neglected, as predicted by proposition 2.2.1. This leads to a uniformly high approximation error on the test data (training vs. test data split 50:50, see appendix
Download figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageThis is confirmed when looking at the power spectrum of the function approximation learned by the neural network in figure 2(D). Inverse Dirichlet weighting leads to results that are as good as those achieved by optimal weights in this test case, in particular when approximating high frequencies, as predicted by proposition 2.2.1. A closer look at the power spectra learned by the different weighting strategies also reveals that the max/avg strategy fails to capture high frequencies, whereas the pinn-MDGA performs surprisingly well (figure 2(D)) despite its unbalanced gradients (figure 1(B)). This may be problem-specific, as the weight trajectories taken by pinn-MDGA during training (figure B.1) are fundamentally different from the behavior of the other methods. This could be because the pinn-MGDA uses a fundamentally different strategy, purely based on satisfying Pareto-stationarity conditions (equation (9)). As shown in figure B.2, this leads to a rapid attenuation of the Fourier spectrum of the residual over the first few learning epochs in this test case, allowing pinn-MGDA to escape the stiffness defined in proposition 2.2.1.
Download figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageIn summary, we observe stark differences between different weighting strategies. Of the strategies compared, only the inverse Dirichlet weighting proposed here is able to avoid vanishing task-specific gradients, performing on par with the ε-optimal solution. The pinn-MGDA seems to be able to escape, rather than avoid, the stiffness caused by vanishing task-specific gradients. We also observe a clear correlation between having balanced task-specific gradient distributions and achieving good approximation and estimation accuracy.
4.2. Inverse Dirichlet weighting enables multi-scale modeling using PINNs
We investigate the use of PINNs in modeling multi-scale dynamics in space and time where vanishing task-specific gradients are a common problem. As a real-world test problem, we consider meso-scale active turbulence as it occurs in living fluids, such as bacterial suspensions [43, 44] and multi-cellular tissues [45]. Unlike inertial turbulence, active turbulence has energy injected at small length scales comparable to the size of the actively moving entities, like bacteria or individual motor proteins. The spatio-temporal dynamics of active turbulence can be modeled using the generalized incompressible Navier–Stokes equation [43, 44]:
where is the mean-field flow velocity and the pressure field. The coefficient ξ0 scales the contribution from advection and ξ1 the active pressure. The α and β terms correspond to a quartic Landau-type velocity potential. The higher-order terms with coefficients Γ0 and Γ2 model passive and active stresses arising from hydrodynamic and steric interactions, respectively [44]. Figure C.2 illustrates the rich multi-scale dynamics observed when numerically solving this equation on a square for the parameter regime and [43] (see appendix
Download figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageWe demonstrate these pathologies by applying PINNs to approximate the forward solution of equation (11) in 2D for the parameter regime explored in reference [43] in both square (figure C.2) and annular (figure 3(E)) domains. The details of the simulation and training setups are given in appendix
Download figure:
Standard image High-resolution imageFigures 3(A) and (B) show the average (over different initializations of the neural network weights) relative L2 errors of the vorticity field over training epochs when using different PINN weighting strategies. As predicted by proposition 2.2.1, the uniformly weighted PINN fails completely both in the square (figure 3(A)) and in the annular geometry (figure 3(B)). Optimization is entirely biased towards minimizing the PDE residual, neglecting the initial, boundary, and incompressibility conditions. This imbalance is also clearly visible in the loss gradient distributions shown in figure C.5(A). Comparing the approximation accuracy after 7000 training epochs confirms the findings from the previous section with inverse Dirichlet weighting outperforming the other methods, followed by pinn-MGDA and max/avg weighting. Inverse Dirichlet weighting is found to assign adequate weights for the initial and boundary conditions relative to the equation residual (see figures C.3(A) and (D)).
Download figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageIn figure 3(C), we compare the convergence of the learned approximation to a ground-truth numerical solution for different spatial resolutions h of the discretization grid. The errors scale as with convergence orders r as identified by the dashed lines. While the pinn-MGDA and max/avg weighting achieve linear convergence (r = 1), the inverse Dirichlet weighted PINN converges with order r = 4. The uniformly weighted PINN does not converge at all and is therefore not shown in the plot. The evolution of the Fourier spectra of the residuals of both velocity components shown in figure C.4 confirm the rapid solution convergence achieved by inverse Dirichlet weighted PINNs.
Inverse Dirichlet weighting also perfectly recovers the power spectrum E(k) of the active turbulence velocity field, as shown in figure 3(D). This confirms that inverse Dirichlet weighting enables PINNs to capture high frequencies in the PDE solution, corresponding to fine structures in the fields, by avoiding the stiffness problem from proposition 2.2.1. Indeed, the PINNs trained with the other weighting methods fail to predict small structures in the flow field, as shown by the spatial distributions of the prediction errors in figures 3(F)–(I) compared to the ground-truth numerical solution in figure 3(E).
In Appendices
Download figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageIn summary, these results suggest that avoiding vanishing task-specific gradients can enable the use of PINNs in multi-scale problems previously not amenable to neural-network modeling. Moreover, we observe that seemingly small changes in the weighting strategy can lead to significant differences in the convergence order of the learned approximation.
4.3. Inverse Dirichlet weighting protects against catastrophic forgetting
Artificial neural networks tend to 'forget' their parameterization when sequentially training for multiple tasks [46], a phenomenon referred to as catastrophic forgetting. Sequential training is often used, e.g. when learning to de-noise data and estimate parameters at the same time, like combined image segmentation and denoising [47], or when learning solutions of fluid mechanics models with additional constraints like divergence-freeness of the flow velocity field [16].
In PINNs, catastrophic forgetting can occur, e.g. in the frequently practiced procedure of first training on measurement data alone and adding the PDE model only at later epochs [48]. Such sequential training can be motivated by the computational cost of evaluating the derivatives in the PDE using automatic differentiation over the network, so it seems prudent to pre-condition the network for data fitting and introduce the residual later in order to regress for the missing network parameters.
Using the active turbulence test case described above, we investigate how the different weighting schemes compared here influence the phenomenon of catastrophic forgetting. For this, we consider the inverse problem of inferring unknown model parameters and latent fields like the effective pressure from flow velocity data u i . We train the PINNs in three sequential steps: (1) first fitting flow velocity data, (2) later including the additional objective for the incompressibility constraint , (3) finally including also the objective for the PDE model residual. This creates a setup in which catastrophic forgetting occurs when using uniform PINN weights, as confirmed in figure 4(A) (arrows). As soon as additional objectives (divergence-freeness at epoch 1000 and equation residual at epoch 2000) are introduced, the PINN loses the parameterization learned on the previous objectives, leading to large and increasing test error for the field u .
Download figure:
Standard image High-resolution imageDynamic weighting strategies can protect PINNs against catastrophic forgetting by adequately adjusting the weights whenever an additional objective is introduced. This is confirmed in figure 4(A) with inverse Dirichlet and pinn-MGDA only minimally affected by catastrophic forgetting. The max/avg weighting strategy [34], however, does not protect the network quite as well. When looking at the test errors for the learned model coefficients (figure 4(B)) and the accuracy of the estimated latent pressure field (figure 4(C)), for which no training data is given, the inverse Dirichlet weighting proposed here outperforms the other approaches. Snapshots of the reconstructed pressure fields using different strategies are shown in figure C.7. Results for inference from noisy data are shown in figures C.6 and C.8.
In summary, inverse Dirichlet weighting can protect a PINN against catastrophic forgetting, making sequential training a viable option. In our tests, it offered the best protection of all compared weighting schemes.
5. Conclusion and discussion
PINNs have rapidly been adopted by the scientific community for diverse applications in forward and inverse modeling. It is straightforward to show that the training of a PINN amounts to a MTL problem, which is sensitive to the choice of regularization weights. Trivial uniform weights are known to lead to failure of PINNs, empirically found to exacerbate for physical models with disparate scales or high derivatives.
As we have shown here, this failure can be explained by analogy to numerical stiffness of PDEs when understanding Sobolev training of neural networks [7] as a special case of PINN training. This connection enabled us to prove a proposition stating that PINN learning pathologies are caused by dominant high-frequency components or high derivatives in the physics prior. The asymptotic arguments in our proposition inspired a novel dynamic adaptation strategy for the regularization weights of a PINN during training, which we termed inverse Dirichlet weighting.
We empirically compared inverse Dirichlet weighting with the recently proposed max/avg weighting for PINNs [34]. Moreover, we looked at PINN training from the perspective of multi-objective optimization, which avoids a-priori choices of regularization weights. We therefore adapted to PINNs the state-of-the-art MGDA for training multi-objective artificial neural networks, which has previously been proven to converge to a Pareto-stationary solution [33]. This provided another baseline to compare with. Finally, we compared against analytically derived ε-optimal static weights in a simple linear test problem where those can be derived, providing an absolute baseline. In all comparisons, and across a wide range of problems from 2D Poisson to multi-scale active turbulence and curvature-driven level-set flow, inverse Dirichlet weighting empirically performed best. We also found that all weighting strategies are in general agnostic to the choice of the activation function, with inverse Dirichlet weighting outperforming the other strategies irrespective of the activation function chosen.
The inverse Dirichlet weighting proposed here can be interpreted as an uncertainty weighting with training uncertainty quantified by the batch variance of the loss gradients. This is different from previous approaches [31] that used weights quantifying the model uncertainty, rather than the training uncertainty. The proposition proven here provides an explanation for why the loss gradient variance may be a good choice, and our results confirmed that weighting based on the Dirichlet energy of the loss objectives outperforms other weighting heuristics, as well as Pareto-front seeking algorithms like pinn-MGDA. The proposed inverse Dirichlet weighting also performed best at protecting a PINN against catastrophic forgetting in sequential training, where loss objectives are introduced one-by-one, making sequential training of PINNs a viable option in practice.
While we have focused on PINNs here, our analysis equally applies to Sobolev training of other neural networks, and to other multi-objective losses, as long as training is done using an explicit first-order (stochastic) gradient descent optimizer, of which Adam [38] is the most commonly used. In all our examples, we found that the Adam optimizer leads to better convergence than other first-order update rules like RMSprop or Adagrad (see figure D.2). However, hybrid strategies can also be envisioned. For instance when the optimizer has converged close to a local optimum, the uncertainty in the gradients is small relative to the mean gradient component [49]. In such scenarios, inverse Dirichlet weighting can lead to large optimization step sizes that plateau the learning (see figure E.1). A hybrid strategy could then be beneficial, combining initial inverse Dirichlet weighting with, e.g. LBFGS-based fine-tuning close to a local optimum [16].
Taken together, we have presented a connection between PINNs and Sobolev training in order to explain a likely failure mode of PINNs, and we have proposed a simple yet effective strategy to avoid training failure. The proposed inverse Dirichlet weighting strategy is easily included into any first-order optimizer at almost no additional computational cost (see figure C.9). It has the potential to improve the accuracy and convergence of PINNs by orders of magnitude, to enable sequential training with less risk of catastrophic forgetting, and to extend the application realm of PINNs to multi-scale problems that were previously not amenable to data-driven modeling.
Acknowledgments
This work was supported by the German Research Foundation (DFG) – EXC-2068, Cluster of Excellence 'Physics of Life', and by the Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI) Dresden/Leipzig, funded by the Federal Ministry of Education and Research (BMBF). The present work was also partly funded by the Center for Advanced Systems Understanding (CASUS), which is financed by Germany's Federal Ministry of Education and Research (BMBF) and by the Saxon Ministry for Science, Culture and Tourism (SMWK) with tax funds on the basis of the budget approved by the Saxon State Parliament.
Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: https://github.com/mosaic-group/inverse-dirichlet-pinn.
Appendix A
A.1. Analytically optimal weights
For Sobolev loss functions based on elliptic PDEs, optimal weights λk that minimize all objectives in an unbiased way can be determined by ε-closeness analysis [42]. ε -closeness
Definition 1. A candidate solution u is called ε-close to the true solution if it satisfies
for any multi-index and , thus with .
For the loss function typically used in Sobolev training of neural networks, i.e.
We can use ε-closeness analysis to show that
Using this inequality, we can bound the total loss as
with . This bound can be used to determine the weights that result in the smallest ε:
The weights , that lead to the tightest upper bound on the total loss in equation (A4) can be found by solving the maximization-minimization problem described in following corollary: A.1.1.
Corollary A maximization-minimization problem can be transformed into a piecewise linear convex optimization problem by introducing an auxiliary variable z, such that
The analytical solution to this problem exists and is given by .
For a Sobolev training problem with loss given by equation (5), weights that lead to ε-optimal solutions are computed using
A.2. PINN multiple gradient descent algorithm (pinn-MGDA)
Appendix B.: Sobolev training
B.1. Proof of proposition 2.2.1
Let us assume the Fourier spectrum of the signal is a delta function, i.e. , with k0 being the dominant wavenumber. The residual at training time (t) is given as , and its corresponding Fourier coefficients for the wavenumber k is given by . The learning rate of the residual for the dominant frequency is given as,
where and . Equation (B1) follows from the gradient flow of the parameters as described in equation (4). From Parseval's Theorem [50], we can compute the magnitude of the gradients for both objectives and with respect to the network parameters θ.
The second step in the above calculation involves triangular inequality and the assumption that the Fourier spectrum of the signal is a delta function. In a similar fashion, we can repeat the same analysis for the Sobolev part of the objective function as follows,
By substituting equations (B2) and (B3) into equation (B1), we get,
We use the fact that the spectral rate of the network function residual is inversely proportional to the wavenumber, i.e. [26].
B.2. Sobolev training setup
We use a neural network with five layers and 64 neurons per layer and activation function. The target function is a sinusoidal forcing term of the form,
where M = 20 and is the size of the domain. The parameters are drawn independently and uniformly from the interval , and similarly are drawn from the interval . The parameters which set the local length scales are sampled uniformly from the set . The target function is given on a 2D spatial grid of resolution covering the domain . The grid data is then divided into a train and test split which corresponds to 8192 training points used. We train the neural network for 20 000 epochs using the Adam optimizer [38]. Each epoch is composed of two mini-batch updates of batch size . The initial leaning rate is chosen as and decreased by a factor of 10 after 10 000 and 15 000 epochs.
Appendix C.: Forward and inverse modeling in active turbulence
The ground truth solution of the 2D incompressible active fluid is computed in the vorticity formulation which is given as follows,
The simulation is conducted with a pseudo-spectral solver, where the derivatives are computed in the Fourier space and the time integration is done using the Integration factor method [40, 51]. The equations are solved in the domain with a spatial resolution of and a time-step dt = 0.0001. The incompressibility is enforced by projecting the intermediate velocities on to the divergence-free manifold. The simulations in the square domain are performed with periodic boundary conditions, where the values of the parameters are chosen as follows: [43].
C.1. Forward problem setup
We study the forward solution of the active turbulence problem using PINNs in the formulation. We employ a neural network with seven layers and 100 neurons per layer and -activation functions. For choosing the activation function we conducted extensive numerical experiments using different activation functions for a simple function reconstruction problem with the objective
where . In figure C.1 we clearly see that the sin activation function leads to the best approximation accuracy for the velocities and their associated derivatives. The vorticity ω is directly computed from the network outputs using automatic differentiation. For the solution in the square domain, we select a sub-domain with 50 time steps (dt = 0.01) accounting to a total time of 0.5 units in simulation time. We choose points for the initial conditions, boundary points and a total of residual points in the interior of the square domain. The network is trained for 7000 epochs using the Adam optimizer with each epoch consisting of 49 mini-batch updates with a batch size . The velocities are provided at the boundaries of the training domain.
For the forward simulation in the annular geometry, the domain is chosen with the inner radius of units and the outer radius units. We choose points for the initial conditions, boundary points and a total of residual points in the interior of the domain. The network training is again performed over 7000 epochs using Adam with each epoch consisting of 41 mini-batch updates with a batch size . Dirichlet boundary conditions are enforced at the inner and outer rims of the annular geometry by providing the velocities. The network is optimized subject to the loss function
Here, correspond to the weights of the objectives for boundary condition, equation residual, divergence, and initial condition, respectively.
C.2. Inverse problem setup
For studying the inverse problem we resort to the primitive formulation of the active turbulence model, with as the effective pressure. We used a neural network with five layers and 100 neurons per layer and sin-activation functions. Given the measurement data of flow velocities at discrete points, we infer the effective pressure along with the parameters . The total of points, sampled from the same domain as for the forward solution in the square domain, are utilized with a train and test split, which corresponds to training points used. The network is trained for a total of 7000 epochs using the Adam optimizer with each epoch consisting of 35 mini-batch updates with a batch-size . We optimize the model subject to the loss function
with corresponding to the data-fidelity, equation residual and the objective that penalises non-zero divergence in velocities, respectively. The training is done in a three-step approach, with the first 1000 epochs used for fitting the measurement data (pre-training, ), the next 1000 epochs training with an additional constraint on the divergence (), and finally followed by the introduction of the equation residual (). At every checkpoint introducing a new task, we reset all λk for the computation of (see equation (7) main text) of the max/avg variant to . For both, forward and inverse problems, we choose an initial learning rate and decrease it by a factor of 10 after 3000 and 6000 epochs (see figure C.3).
Appendix D.: 2D poisson problem
We consider the problem of solving the 2D Poisson equation given as,
with the parameter ω controlling the frequency of the oscillatory solution . The analytical solution is computed on the domain with a grid resolution of . The boundary condition is given as , . We set up the PINN with two objective functions one for handling boundary conditions and the other to compute the residual with the loss
We set sampled uniformly on the four boundaries, and randomly sample points from the interior of the domain. We choose a network with five layers and 50 nodes per layer and the tanh activation function is used. The neural network is trained for 30 000 epochs using the Adam optimizer. We choose the initial learning rate and decrease it by a factor of 10 after 10 000, 20 000 and 30 000 epochs. In figure D.1 we show the relative L2 error for different weighting strategies across changing wavenumber ω. We notice a very similar trend as observed in the previous problems, with inverse Dirichlet scheme performing in par with the analytical ε optimal weighting strategy, followed by max/avg and pinn-MGDA. Using the Poisson example, we also compare the effect of the choice of the optimizer on the performance of weighting strategies. In figure D.2, we show the performance of weighting strategies using the RMSprop optimizer (left) and Adagrad (right). We find that the inverse Dirichlet strategy outperforms the other strategies in both cases. Especially for Adagrad, we find our weighting strategy to be the most accurate by an order of magnitude. Interestingly, static ε-optimal weights fail to capture the solution for high wavenumbers when using Adagrad (see figure D.2 right).
Appendix E.: Curvature-driven level set
The level-set method [52] is a popular method used for implicit representation of surfaces. In contrast to explicit methods for interface capturing, level-set methods offer a natural means to handle complex changes in the interface topology. For this reason, level-set methods have found extensive applications in computer vision [53], multiphase and computational fluid dynamics [54], and for solving PDEs on surfaces [55]. For this example, we consider the curvature driven flow problem that evolves the topology of a complex interface based on its local curvature, and can be tailored for image processing tasks like image smoothing and enhancement [53, 56]. Specifically, we look at the problem of a wound spiral which relaxes under its own curvature [52]. The parametrization of the initial spiral is given by with . The point locations of the zero contour can be computed as,
The zero level set function can then be computed using the distance function as with w being the width of the spirals. The distance function measures the shortest distance between two points x and x p . Given the level set function , we can compute the 2D curvature κ as,
The evolution of the level set function driven by the local curvature κ and is computed using the PDE,
The simulation was conducted on an resolution spatial grid covering the domain using the finite-difference method with necessary reinitialization to preserve the sign distance property of the level-set function φ. The spiral parameters are chosen to be [57]. Time integration is performed using a second-order Runge-Kutta scheme with a step size of dt = 0.001.
E.1. Training setup
The training data is generated from uniformly sampling in the domain with a resolution of in space and 80 time snapshots separated with time-step dt = 0.001. This lead to and points in the interior of the space and time domain. The network is trained for 2500 epochs using the Adam optimizer with each epoch consisting of 187 mini-batch updates with batch size . The neural network with 5 layers and 64 nodes is chosen for this problem. We found that using the ELU activation function resulted in the best performance. The total loss is given as,
We choose the initial learning rate as and decrease it after 1500 and 2000 epochs by a factor of 10.
Appendix F.: Network initialization and parameterization
Let be the set of all input coordinates to train the neural network, where N denotes the total number of data points and the dimension of the input data, i.e. corresponding to . Moreover, will denote a single batch, by randomly subsampling into subsets. We then define an n-layer deep neural network with parameters θ as a composite function , where corresponds to the dimension of the output variable. The output is usually computed as
with the affine linear maps , , , and output . Here, describes the weights of the incoming edges of layer i with qi neurons, contains a scalar bias for each neuron and is a non-linear activation function. The network parameters are initialized using normal Xavier initialization [39] with , where is a Gaussian distribution with zero mean and standard deviation
and . The gain g was chosen to be g = 1 for all activation functions except for tanh with a recommended gain of . Before feeding into the first layer of the neural network, we normalize it with respect to the training data as
where and are the column-wise mean and standard deviations of the training data .
F.1. Weights computation
We initialize all weights to . For all experiments we update the dynamic weights at the first batch of every fifth epoch. The moving average is then performed with α = 0.5. The mini-batches are randomized after each epoch, such that for every update of the weights, λk will be computed with respect to a different subset of all training points . Furthermore, batching is only performed over the interior points Nint. As , the points for initial and boundary conditions do not require batching and remain fixed throughout the training.